Understanding the difference between etl and elt and how they are utilised in a modern data platform is important for getting the best outcomes out of your Data Warehouse.
The difference between and ETL and ELT has created an ongoing debate as to which one is the optimal choice for enterprise data storage and analytics. The discourse has shifted back and forth affected by changes in data platform technology and reductions in processing constraints. The distinction comes down to the order in which Transformation and Loading occurs, which can make a difference depending on the data tasks being performed.
What is ETL?
Extract Transform and Load (ETL) refers to the process of transforming data before it is loaded into a data warehouse environment. The key defining feature of an ETL approach is that data is typically processed in-memory rather than in-database.
What are the Benefits of an ETL Pipeline?
The arguments for ETL traditionally have been focused on the storage cost and available resources of an existing data warehouse infrastructure.. Utilising in-memory processing allowed you to set up a data pipeline which was not constrained by disk throughput, instead opting to use in-memory compute resources. It also offered the advantage of efficiently conducting advanced processing using programmatic operations where “set-based” operations didn’t address some transformation needs.
What is ELT?
Extract Load and Transform (ELT) refers to the process of extracting data from source systems, loading the data into the Data Warehouse environment and then transforming it afterwards using in-database operations such as SQL. It relies on having the capacity to initially store large volumes of raw data.
What are the Benefits of an ELT Pipeline?
There are a number of advantages that ELT offers when setting up a data and analytics pipeline particularly in terms of scalability. Cloud-based compute capacity can be used in a highly elastic way, purchased on a per- hour/minute on an on-demand basis with resources allocated dynamically to prevent bottlenecks. Cloud-based Massively Parallel Processing (MPP) Data Warehouse Platforms also offer a way to overcome traditional disk throughput limitations in a highly scalable way.
SQL tends to cater very well for a majority of transformations for Analytics, and for more complex processing many database technologies support in-database Python/R. Another advantage of ELT, is that data is already loaded into the presentation layer, so there’s no extra latency of loading the data after transformation.
What is ELTL?
With increasing volumes and variety of data there are additional options that are available such as ELTL. This involves loading the data immediately into scalable low-cost storage following which it is transformed and loaded again into a more advanced presentation layer. This is a useful option if you have a large variety of data sources which will be utilised for a range of purposes. A Data Lake can be established for data discovery/data science with a different segment of the data being processed, presented and made to conform in a traditional Data Warehouse format. It is also a useful approach when you need to pre-process data to conform with a tabular structure for loading into a data warehouse.
What is ELLT?
A variation ELT, data is extracted from source systems and loaded into low-cost storage after which it is transferred to a staging area of a cloud data warehouse. At this point the data is transformed to conform to a typical data warehouse data model. ELLT is useful in a situation where your Data Warehouse target is a cloud data warehouse such as Azure SQL DW, Snowflake, Google Big Query or Amazon Redshift and can scale easily.
What is EL now, T later?
Also referred to as “Late Binding”, this can be summarised as the philosophy of loading a data lake with raw data and only conforming the data at the time that a query is run. This typically presents some data governance challenges, a topic for another article (read Why a Data Lake is not a silver bullet for Analytics ).
What to Consider When Building a Modern Data Platform
When deciding between these different approaches to enterprise data warehouses, there are several questions you should be asking in order to make the right choice. Firstly, if you are implementing a Cloud Data Platform, it is expected that data will be initially landed into a cloud storage area. This would mean that you would have to select between ELTL and ELLT.
Furthermore, it depends on what kind of processing the data will need and the ways in which it will be utilised. Set based operations, as are typically used in ETL data warehouses, are highly efficient but have a ceiling of complexity. If you are looking to implement programmatic operations, you may need to consider using other languages such as Python.
ELT has the additional benefit of reducing complexity, with maintenance and tweaks being easier to implement. The transformation code can be presented as a logical tier of the data warehouse itself and it is easier to identify issues and implement changes opposed to trying to locate a single piece of logic in a Jupyter notebook.
Review the availability of skills in the market and in your team. SQL skills are by far the most plentiful for data engineering and is typically the existing skill set of a data warehouse team.
Ultimately, when trying to determine the most cost-effective solution for your data pipeline, the entire end-to-end process should be considered. For example, when weighing up a Data Warehouse against transforming data using Spark, the three distinct stages to consider to measure are:
- Time to load the data into memory
- Transforming and materialising the output
- Loading the data into a presentation layer suitable for multiple people to query at the one time.
How Loome Can Help With ELT Software
With over a hundred different connectors, Loome Integrate is an intuitive data pipeline tool which can help you get from source to target regardless whether you’re using an ETL or an ELT approach. Complete visibility over every source, channel and transformation as well as an advanced data task orchestration tool gives you the tools you need to effectively manage your Data Warehouse.