A Data Warehouse is built by collecting data from different data sources, and transforming it to a structure that’s easy to analyse and report on (typically a dimensional model). In order to build a data warehouse, we need to write code, or use tools that allow us to perform ETL: Extract, Transform and Load.
These processes Extract data from source systems (databases, APIs, files), Transform the data to the data model we use in the Date Warehouse (usually a dimensional model), and Load it to the Data Warehouse. Another approach is ELT, in which we Extract data from source systems,Load it as-is to a staging area in the Data Warehouse, and then Transform it by running transformations processes inside the data warehouse.
E, L and T Explained
Let’s explain some more about each of the stages:
We build the Data Warehouse by collecting data from various data sources. The data sources can have different shapes: a relational database, a text file, and an API to name a few. Furthermore, these data sources may be internal to the organisation or external (Google Analytics API, Social media, etc.). The method we use to extract data from a source system depends on how it is stored, and who owns it. For example, if we want to extract data from an internal relational database, we need to write queries. If this is an external API, such as Google Analytics, we can write a script that fetches data through the Google Analytics API (using Python or other programming language), to extract the data.
No matter if we use ETL or ELT, we’ll need to load the data we extracted into the Data Warehouse. In ETL, the data we load has already been transformed into the DW model. If it is ELT, the data is in its original form. Either way, we need to load the data into the Data Warehouse. If the volume is large, we should optimise the process, ideally by using bulk load database techniques. The end result of the load process is tables populated with data. In ETL, these are the tables we can start using to consume data (using visualization products like Power BI or Tableau, or by running database queries). In ELT, these are the tables we use as source to the transformation processes.
Transforming the data from its source format to the target includes data cleansing, notifying on data problems and populating the target tables. This is not just about taking source data and reshaping it, it is also where we create new data in the form of KPIs. Think about a data source that includes information about blood tests. We transform the structure of the data into a star schema (dimensional model) but at the same time we can already start analysing the data and creating KPIs like a probability score to have heart issues. In ETL, we can write transformations by using ETL products. In ELT, since the data has been loaded to the staging area and is already inside our data warehouse, we often write stored procedure to transform the data. Most relational databases come with a built in language : Oracle PL/SQL, Microsoft’s T-SQL and PostgreSQL’s PLPGSQL, to name a few.
ETL vs. ELT
I usually prefer the ELT approach. Here’s why:
- We load the source data as-is into a mirror(staging) schema in the DW, which means we keep a persistent copy of past data. As a result we’ll be able to re-process old data at later stages, even if it is no longer available through APIS to source systems.
As an example, think about a daily CSV file you receive from an external provider. Let’s assume it has 35 fields in it. At first, you are only interested in 12 of the fields, but a year later, you may realize there are 6 other fields that can be of interest to your users. In the ELT approach, you’ve been keeping a persistent copy of the daily file by loading it to the Data Warehouse, as is. You can go back and reuse this information. In ETL, however, you haven’t kept a persistent copy of the information, meaning you can only use the other 6 fields from now on.
- ELT allows us to create data transformations by coding stored procedures, as opposed to using an ETL product transformation. This is a benefit because it allows for the implementation of complex business rules using a programming language. Complex business rules cannot easily be implemented by ETL products, especially not the cheap or open source ones. Also, I’ve found that it’s easier to hire database developers than specific ETL product specialists.
- Avoid high costs of ETL products, as all we need in ELT is a way to copy data from source to target, as-is. You can do that using simpler and cheaper(or free) products, as well as possibly by coding data transfers yourself, using Python and the likes.
There are many ETL/ELT products in the market. These products usually provide a drag and drop style UI that enables the developer to connect to various
types of sources, move data between source and target, and transform it to a dimensional model. The good ones have a lot of connectors that allow you to connect to any source type you want. They also include cleansing modules and advanced logging options, as well as source control plugins.
These products can be very expensive (Informatica, Data Stage), but there are also open source options (Talend, Pentaho Data Integration).
There are also ETL-as-a-service products such as AWS Glue. One of the most popular ETL products is Microsoft’s SSIS (SQL Server Integration Services). It is a bit limited, but it is still very popular due to being part of the Microsoft BI stack.
In the next few days I’ll write another article about a specific product, Pentaho Data Integration (Kettle), which has an open source community edition. It is free and pretty good. See you soon.
If you enjoyed the content of this article, please consider sharing it with your network on LinkedIn. As an independent consultant, this helps my business. Thank you!