The five most important points to consider when developing a data lake
Many organizations possess a wealth of historical data. From customer data to financial data and from research data to social media data. Often this data is locked in different systems and applications. A classic IT issue is how to combine this data into centralized insights. But how do you make better choices based on these insights? And how do you achieve greater efficiency?
A frequently used solution to combine data sources is to develop a data lake. A data lake is a form of data storage in which data from multiple sources is initially stored in raw form. Based on this raw data, data can be enriched, combined analyses can be performed, or processes can be connected that use the combined data. Data lakes are often cited as accelerators in areas such as data science and machine learning. Despite the appealing idea of all data being centralized in one place, it has become clear in recent years that data lakes are not magical solutions. Analyst firms such as Gartner and Forrester have confirmed for a number of years that a striking number of implementations have failed, in whole or in part, due to wrong expectations or wrong choices1).
From Luminis, we also encounter this in the market. Implementation problems often arise from incorrect expectations combined with too much focus on the technology. From both theory and practice, we take you through some of the considerations, constraints, opportunities and risks. We also outline alternative solutions for a data lake. This will put you in a better position to make a choice and you can take the right steps to be successful in implementing your data strategy.
The start of the 'big-data' era
Until 2005, by far the most widely used form of storage was a relational database. For more than 30 years, relational databases filled most development and user needs just fine. Especially with the role of the Internet, the amount of data generated grew rapidly and there was a greater need for analysis of these large amounts of data. For example, in the case of websites with many users or many interactions. The emergence of techniques such as NoSQL and Bigtable therefore marked the start of the 'big data' era in 2005. Relational databases are very good at processing transactions.New storage methods such as Bigtable and Hadoop are much better suited for storing, processing and analyzing big data due to their structure and architecture. This is also called Online Analytical Processing (OLAP).
In addition to back-office applications, a data warehouse was increasingly set up for analysis purposes, as most enterprise applications are not suitable for large-scale and complex analyses. Data from applications (Finance, logistics, CRM etc.) comes together in a data warehouse, where the data is stored in a structured way. In data warehouses, data is stored as much as possible in structures such as tables, with the right metadata, according to strict definitions and traceable to the source. The growth in the number of data warehouses also meant that more and more organizations were creating Business Intelligence roles or departments.
The new term 'data lake'
In 2010, James Dixon, CTO of analytics platform Pentaho, came up with the new term "data lake. He used a metaphor for this. He compared the structured data in data warehouses to bottles of water. All standardized size and quality, and ready for use. For some applications this is fine, but for other applications bottled water is not useful at all. For data science or exploratory research, it can be useful to have access to a large amount of water, without the limitations of a bottle. Such a lake of data - or data lake - provides users with direct access to the raw and unstructured data, so that the users can invent their own application with it.The mandatory data schemas within a data warehouse are a key difference from data lakes, which work fundamentally with raw data
Ten years later, the term data lake has become an architectural pattern that is regularly used within organizations. An entire ecosystem of companies providing data lake components or turnkey data lakes has emerged.
The future for your organization
With the emergence of cloud infrastructures starting in 2010, more and more data lakes are hosted in the cloud, or developed based on cloud services. It is expected that eventually all data lakes will run in the cloud. Cloud infrastructures, in addition to low start-up costs and virtually unlimited storage, also offer large processing and computing capacity for machine learning applications, for example.
As mentioned earlier, implementation problems often arise due to wrong expectations in combination with too much focus on technology. You should therefore be well supported in the considerations, preconditions, opportunities and risks. Laying out an alternative solution is also of great importance. Need help? Please contact us.