ETL Products
In my previous article I talked about ETL and ELT, and today I want to talk briefly about a specific ETL product in the market. This article assumes basic knowledge about what a data warehouse is, and what are ETL and ELT. If you are not familiar with these terms, please start by reading this and this.
This product is the open-source, community edition of Pentaho Data Integration, also called PDI, or Kettle.
There are plenty of expensive ETL products in the market, from the likes of Industry leaders like IBM, Oracle and Microsoft, but not many cheap or free options. An ETL product is the one part of a BI solution where if you are on a small budget, you don’t have many options.
Open Source ETL Products
If you search to free or cheap options, two products are often mentioned – Talend Open Studio and Pentaho Data Integration.
The problem with both, is that sometimes it isn’t easy to configure them and start using them quickly. It’s not bad, but I’m sure drives people away in some cases. Both products ask you to install a specific Java versions and to configure it properly. Once you’ve configure it correctly though, these are very strong products. I haven’t yet had the chance to work with Talend Open Studio, but I’ve worked with PDI on three different projects, and I’m very impressed. In fact, I’m much more impressed with PDI then I am with Microsoft’s SSIS, which I’ve used a lot, and costs much more.
Pentaho Data Integration
Pentaho was founded in 2004 in Florida. The company has built a Business Intelligence products suite. I’m not a technology historian but my understanding is that at the begining they were focusing on open source offerings and then started to build premium services on top. I’ve first heard of Pentaho when I interviewed for a BI manager role in 2010 and the company I interviewed for was already using their community edition. I ended up starting a new role elsewhere and only looking at Pentaho again until 2016, but that’s beside the point.
In 2015, Pentaho was acquired by Hitachi, and since then it seems like the community edition is being more and more hidden, which can be understood as Hitachi is trying to maximize revenue.
In the next few sections I’ll briefly go over the product components.
PDI Basics
PDI is written in Java and therefore can run on Unix/Linux/Mac and Windows.It is not just one product, but rather a group of programs responsible for different parts of the ETL solution.
The main program, where ETL development is done is called “Spoon”. This is the development environment of your ETL processes. It comes with a simple to use drag-and-drop UI which enables the creation of data and control flows, or “transformations” and “jobs” which will be explained shortly. Once transformations and jobs are designed using Spoon, “Pan” can be used to execute transformations in command line and “Kitchen” does the same with jobs.
Spoon
Spoon is used to define data connections, create data transformations and coltrol flows. Let’s start with data connections.
The “E” and “L” in ETL stand for Extracting data from sources and Loading data to the target. In a typical Data Warehouse environment we’ll probably want to extract data from various sources and load to one target – the data warehouse. Spoon has a large number of connectors included, and you can also create custom connectors. Some of the main data connectors are: database tables (almost any relational database you can think of is supported), CSV files, spreadsheets, Google Analytics, AWS S3, email messages and XML files.
A classic case in an ELT approach is to load a CSV file, as-is to a database table. In such case all we need to do is define the CSV file as source and a database table as target. Spoon can detect field names and try to craete the mapping between source and target automatically, or you can do it yourself. Furthermore, If the target table does not exist, Spoon can generate a SQL script for you to run and create it.
Transformations and Jobs
In other cases our data flow may not be that simple. You may want to transform the data before loading it. There are many transformations included in the product. For example you can sort your data using the “sort rows” transformation. You may want to split a source field into several fields, for instance split a date field into day, month and year fields. You can do that using the “Split Fields” transformation. Another example is adding a calculated field, using the “calculator” transformation. Here are a few of the available transformations:
If you are using transformations, you’ll create hops, which connect between your sources, targets and transformations. Hops are the arrows in the picture below:
In this example you can see a transformation that takes two CSV files, one holding incoming flights records and the other outgoing flights. Both files are being sorted, using the sort rows transformation, and merged together using a merge join transformation. Then, the merged data is loaded to a database table.
In addition to transformations, Spoon also allows you to create jobs, which are control flows. You can use jobs to set up one transformation to run after another, and also to do some preparations and checks. For instance, you can have a copy file step in a job, followed by a transformation execution. Here’s a job example:
In this example, we start the job by running a short JavaScript that compiles a file name using the current date, in order to determine which file to look for. Then, the job runs a transformation that loads the file into a mirror table. The next step is a transformation that merges data from the mirror table to a Fact table. When this is done, the next step is to zip the file.
Pan and Kitchen
While you can create and run transformations and jobs in Spoon, eventually you’ll want to automate and schedule them to run at specific times. The community edition doesn’t have a scheduler included but there’s a simple way around it. You can use Pan to run transformations in command line, and you can use Kitchen to run jobs in command line. This means that you can create shell scripts (in Unix/Linux) or CMD scripts (in Windown) which call Pan and Kitchen.
These scripts can then be scheduled using cron (Unix/Linux) or Task Scheduler (windows).
Additional Reading & Useful Links
The community edition can be downloaded here. No matter what operating system you are running on, just download the file and extract it. The folder created includes the executables for all operating systems.
This link takes you to a very quick mini-course which shows PDI in action. It demonstrates mist of what I’ve talked about in this article.
Additional links:
I hope this article gave you an idea about what PDI is. If you have any questions or suggestions, please comment below.
If you enjoyed the content of this article, please consider sharing it with your network on LinkedIn. As an independent consultant, this helps my business. Thank you!