A Data Lake refers to a raw file persistent storage area which can handle a wide variety of file types, making them available for further processing. It offers a good deal of flexibility, but is not necessarily a shortcut to effective analytics


Why use a Data Lake?

A Data Lake offers the advantage of being able to defer certain decisions about how data should be modelled and processed like it would in a Data Warehouse. It is a central point of storage which makes data available for possible future use and analytics extraction.

Does a Data Lake Speed up Analytics?

Even though it is faster to get the data into storage with a Data Lake, it does not mean that it is necessarily any faster to derive insight. Whether or not you’re using a Data Warehouse or a Data Lake, the challenge of analytics remains consistent, meaning that you will always have to account for:

  1. Getting agreement and opt-in across the business on the definition of business terms.
  2. Establishing the automatic reshaping of the data to adhere to these definitions.
  3. 3. Efficiently explore new potential data sets.

Whilst a data lake can provide convenient access to data for exploration, ultimately time and human thinking needs to be applied in order to derive value from data. To drive action from data requires a publishing platform such as a Data Warehouse for multiple people to consume data insights.

When is a Data Lake Effective?

A Data Lake is useful for raw data discovery. When your organisation has skilled data practitioners, they can “audition” new data sets to answer particular business questions, all from a single data repository. An example of this is a company that stores all of its call centre recordings as audio files in a Data Lake for future analysis. A marketing initiative is launched which calls for converting the audio files to text and generating a sentiment score for each. At this point the audio files in the Data Lake can be utilised for this purpose without having had to be processed and prepared for consumption within a Data Warehouse.

Another use for a Data Lake is as a staging area for machine generated data. Internet of Things devices generate large volumes of data, the utility of which may not be immediately apparent. A Data Lake can store these logs in order for them to be processed in future, for instance, to build a predictive model to anticipate future outages.

Additionally, it can be used as a staging area for Cloud Data Warehouses as required by many modern cloud data platforms. Or as a data archive to offload data from a Data Warehouse, maintaining historical data while improving overall query times and compute cost.

When is a Data Lake not Effective?

There are some limitations inherent to a Data Lake which are not ideal for certain scenarios. One of these is when an audit trail is required, to track changes at the row level when changes are not stored in a source system. This can be important for compliance purposes or machine learning initiatives that rely on a history of progression of a particular business process. as well as a number of other different business reasons for this functionality and Data Lake is typically not the best way to achieve it. Although there are some offerings which attempt to replicate this functionality in a Data Lake, it is important to know whether they are officially supported.

Furthermore, it is important to understand that Data Lakes do not support traditional analytics consumption patterns such as those provided by Data Warehouse Platforms. A Data Lake does not typically support Live Connection or Direct Query to underlying data from a visualisation tool. Note there are some exceptions to this such as technologies such as PolyBase and Serverless SQL queries.

Query performance is slower than a Data Warehouse Platform and limited in support of concurrent queries and of row level security (to dynamically filter what a data consumer can see). Finally, it is more challenging to find people with the appropriate skillset to manage Data Lake technology, with SQL skills for Data Warehouses being much more ubiquitous.

What to Consider When Setting Up a Data Lake

Ironically, many of the considerations involved in successfully implementing a Data Lake for some purposes involve mimicking the behaviour and functionality of Data Warehouse Platforms. Firstly, the Flat Files used by a Data Lake do not typically store metadata (unless they are structured documents such as JSON) and represent a higher risk of technical metadata being lost. A process is usually implemented to compensate for this by the creation of additional files or the data is loaded into Parquet file format.

Parquet has the additional feature of optimising files a format which provides much faster query times typically associated with Data Warehouses. The trade-off is that this process involves significantly longer initial load times.

Finally, it is important to consider the fact that a large file storage area doesn’t automatically provide context for the data that is placed within it. For best results, some thought needs to be given in advance, ideally by a Data Steward, as to how the data will be logically separated. This involves understanding data sources, planning folder structure and knowing each file format. This is important in setting up the data sets so they can be drawn from with confidence in the future.

How can Loome Help?

Loome Integrate can automate the generation of a Data Lake while preserving technical metadata and load into a Data Warehouse within the same pipeline. It allows you to get started immediately without waiting for weeks or months to set up a Data Lake platform. Furthermore, Loome can set up the staging area part of a cloud data platform while maintaining the important source system metadata.

Relevant Connectors

Azure Data Lake Gen2

Parquet

Amazon S3

Azure Blob Storage