Skip to content

The five most important points to consider when developing a data lake

Many organizations possess a wealth of historical data. From customer data to financial data and from research data to social media data. Often this data is locked in different systems and applications. A classic IT issue is how to combine this data into centralized insights. But how do you make better choices based on these insights? And how do you achieve greater efficiency?

Data lake development

IN SHORT

This is the headline

Many organizations have a wealth of historical data in different systems and applications. How to combine this data into valuable insights.

This is the headline

A common way to combine data sources is to develop a data lake.

This is the headline

Data lakes are often cited as accelerators in areas such as data science and machine learning. Yet, a data lake is not always optimal due to wrong expectations.

A frequently used solution to combine data sources is to develop a data lake. A data lake is a form of data storage in which data from multiple sources is initially stored in raw form. Based on this raw data, data can be enriched, combined analyses can be performed, or processes can be connected that use the combined data. Data lakes are often cited as accelerators in areas such as data science and machine learning. Despite the appealing idea of all data being centralized in one place, it has become clear in recent years that data lakes are not magical solutions. Analyst firms such as Gartner and Forrester have confirmed for a number of years that a striking number of implementations have failed, in whole or in part, due to wrong expectations or wrong choices1).

From Luminis, we also encounter this in the market. Implementation problems often arise from incorrect expectations combined with too much focus on the technology. From both theory and practice, we take you through some of the considerations, constraints, opportunities and risks. We also outline alternative solutions for a data lake. This will put you in a better position to make a choice and you can take the right steps to be successful in implementing your data strategy.

"Analyst firms such as Gartner and Forrester have been confirming for a number of years that a striking number of implementations have failed in whole or in part because of wrong expectations or wrong choices."

The start of the 'big-data' era

Until 2005, by far the most widely used form of storage was a relational database. For more than 30 years, relational databases filled most development and user needs just fine. Especially with the role of the Internet, the amount of data generated grew rapidly and there was a greater need for analysis of these large amounts of data. For example, in the case of websites with many users or many interactions. The emergence of techniques such as NoSQL and Bigtable therefore marked the start of the 'big data' era in 2005. Relational databases are very good at processing transactions.New storage methods such as Bigtable and Hadoop are much better suited for storing, processing and analyzing big data due to their structure and architecture. This is also called Online Analytical Processing (OLAP).

In addition to back-office applications, a data warehouse was increasingly set up for analysis purposes, as most enterprise applications are not suitable for large-scale and complex analyses. Data from applications (Finance, logistics, CRM etc.) comes together in a data warehouse, where the data is stored in a structured way. In data warehouses, data is stored as much as possible in structures such as tables, with the right metadata, according to strict definitions and traceable to the source. The growth in the number of data warehouses also meant that more and more organizations were creating Business Intelligence roles or departments.

The new term 'data lake'

In 2010, James Dixon, CTO of analytics platform Pentaho, came up with the new term "data lake. He used a metaphor for this. He compared the structured data in data warehouses to bottles of water. All standardized size and quality, and ready for use. For some applications this is fine, but for other applications bottled water is not useful at all. For data science or exploratory research, it can be useful to have access to a large amount of water, without the limitations of a bottle. Such a lake of data - or data lake - provides users with direct access to the raw and unstructured data, so that the users can invent their own application with it.The mandatory data schemas within a data warehouse are a key difference from data lakes, which work fundamentally with raw data

Ten years later, the term data lake has become an architectural pattern that is regularly used within organizations. An entire ecosystem of companies providing data lake components or turnkey data lakes has emerged.

The future for your organization

With the emergence of cloud infrastructures starting in 2010, more and more data lakes are hosted in the cloud, or developed based on cloud services. It is expected that eventually all data lakes will run in the cloud. Cloud infrastructures, in addition to low start-up costs and virtually unlimited storage, also offer large processing and computing capacity for machine learning applications, for example.

As mentioned earlier, implementation problems often arise due to wrong expectations in combination with too much focus on technology. You should therefore be well supported in the considerations, preconditions, opportunities and risks. Laying out an alternative solution is also of great importance. Need help? Please contact us.

Development of data lakes

TIPS

This is the headline

New storage methods such as Bigtable and Hadoop are much better suited for storing, processing and analyzing big data due to their structure and architecture. This is also known as Online Analytical Processing (OLAP).

This is the headline

Most enterprise applications are not suitable for large-scale and complex analyses. Increasingly, a data warehouse setup is chosen for analysis purposes. Structured storage in tables, correct metadata, tight final and traceable to the source. This has caused the growth of the number of data warehouses, BI roles and departments.

This is the headline

It is expected that eventually all data lakes will run in the cloud. Cloud infrastructures, in addition to low start-up costs and virtually unlimited storage, also offer large processing and computing capacity for machine learning applications, for example.

Luminis

Luminis

Luminis is a group of companies, headquartered in the Netherlands, that specializes in providing innovative solutions to business and government, primarily using emerging (information) technology.

View Business
Sign up for our newsletter and receive updates.