Continuous Integration (CI) and Continuous Deployment (CD) are well known and familiar to application developers. These approaches deliver a faster pace of testing and deployment, allowing for a more Agile workflow and less turnaround time for changes and new releases. Increasingly, these methodologies are being applied to creating and maintaining Data Warehouses and Analytics systems.
What are the Benefits of CI/CD for Data Warehouses?
Data platform technology is evolving, achieving faster speeds and utilising a wider variety of data sources which comes with heightened expectations of availability by a wider business audience. This has emphasised the benefits that could potentially be achieved by applying CI/CD methodology to ETL and ELT code that is used for the building and maintenance of Data Warehouses.
Overall, CI/CD is a means to manage change in a controlled way. It is not a new problem and has always been central to business information systems, but has typically been prohibitive in terms of cost or usability. When applied to a Data Warehouse, effective change management enables it to be more adaptive and therefore more useful as a business tool.
Business requirements change, because of this the data solution needs to be changed as well to account for more data sources, additional processing or an extension of an existing solution. Having an adaptive business model can only be so effective unless the data system can keep up with it.
What are the Challenges in CI/CD for Data Warehouses?
Implementing Continuous Integration and Continuous Deployment to a Data Warehouse is not the same as implementing it for a more conventional application development environment. The reasons for this are that there are many more moving parts that can cross multiple applications, and the data itself is part of the solution that needs to be managed.
More Moving Parts
The effective management of Data Warehouse transformation logic is challenging because it is a pipeline of different technologies, systems and subsystems. They are linked with complex interdependencies which, if not accounted for in a holistic manner, can cause numerous data quality problems.
Aside from the database itself there are the ingestion tools such as Azure Data Factory (ADF), Fivetran, Loome, Matillion and Talend. Furthermore, there are transformation scripts (written in SQL, Python, Spark) . Finally, there are master data systems (Microsoft SQL Server Master Data Services, Dell MasterHub) and visualisation tools (Power BI, Tableau, Qlik). These all have their own methodology and need to be accounted for in the CI/CD process.
The Nature of Data
A lot of the time when developing code for a Data Warehouse, a developer is blind to the actual conditions in which data will eventually be passing through the pipeline. Conventional application development is done in a much more controlled environment, where not many surprises can be expected from user inputs and the data model. The goal of managing code in an app is different in approach to managing a lot of data in a data system and a poorly thought out Continuous Deployment process can be counterproductive. Successful CI/CD is a matter of good practice catalysed by effective tools.
The following points illustrate some of the complexity of CI/CD and Data Warehousing:
- The main source of data into the Data Warehouse is generally not controlled and will change over time, e.g. data structure of the source system is modified when the application is upgraded.
- As interfaces to source data change, the importance of the version of ETL used to extract the data becomes less important, it becomes more important to retain the data residing in the Data Warehouse (as extracting the data may not be possible anymore).
- Deploying data changes over large volumes can be problematic. Testing the script requires a copy of production data and production scale to execute.
The demands of adaptability and flexibility vs the need to have access to historical data presents a conflict that needs to be factored in.
The Testing Problem
The previous two points make testing particularly difficult when managing the Data Warehouse code deployment process. Whereas the logic aspects of it can be regression tested, the challenges are found when attempting to emulate operational conditions with the source data. Even if relatively appropriate test data is used, it will still typically not be representative of a full operational environment and the loads and situations it will be exposed to.
A script that runs fine on simulated data will not necessarily do so with operational data. Because of this setting up a UAT environment for a Data Warehouse pipeline is challenging. Attempting to mirror a production environment may not be feasible, given that running tests in this way could use up huge volumes of time and paid data warehouse resources. Establishing a test environment that successfully mirrors a production environment may simply not be feasible. Furthermore, setting up test cases is also not as straightforward because of the difficulty in aligning simulation data and the expected result.
How Does Loome Help with Data Warehouse CI/CD?
Loome offers a variety of solutions that help at different stages of developing a Continuous Integration and Continuous Deployment process for Data Warehousing. Loome Integrate provides functionality for data ingestion, staging and transformation, keeping it within a single easy-to-manage tool that has high interoperability with many commonly used systems.
It also offers native integration with GitHub, Azure Dev Ops, GitLab as well as any other Git service, controlling all orchestration logic which is moved between test and production environments. An automated workbench allows the creation of regression testing rules, with the ability to easily move tested code between tiers. Furthermore, it enables teams to work directly with the code within GitHub, making full use of its features such as configuration management and version control.