Running modern ETL-Processes with Framework-Based Tools – Part 1

By September 28, 2021Scalefree Newsletter

A big part of every Enterprise Datawarehouse are ETL- or ELT-processes.
In both abbreviations, the letters stand for the same words, only the order in which each process is done changes.
To brush-up on those processes, “E” stands for extraction, “T” for transformation and “L” is for loading.

That said, rather than dive into the benefits of each,  we would like to present a powerful open-source framework to execute the processes instead.

Why use a framework?

Rather than developing individual solutions per source system, using standardized frameworks provides a wide variety of benefits. The main of which we have already mentioned, standardization.
Another benefit, using the same concept for extracting data from different source systems allows your system to become more auditable and reliable.
And when taking into consideration the varied benefits between frameworks, other potential upsides become available as well.

What is Singer?

One such framework, Singer, is an open-source framework for ETL-processes. Of it’s benefits, Singer describes their aim to be “the standard for writing scripts that move data”.
The framework includes the use of standardized definitions for both data extraction and loading scripts, as well as a unified data format for data exchange between a data source and a target.

How does Singer work?

Singer uses two different types of scripts. Taps are used to extract data from various sources and transform this data into the standardized Singer data exchange format. Targets are using this formatted data to load it into multiple different target-systems. 

Here the advantage of open source projects kicks-in. Everyone can develop their own taps and targets in python following the official Singer development guidelines. As soon as your module is ready to use, you can even submit it to the Singer team and when it meets all their requirements, it will be mentioned at their official website. 

This procedure leads to having more than 100 different taps and 10 targets available, which are all checked and maintained regularly. In addition to that, you can find a tap or target for most other systems on public repositories. Although these modules are not submitted by the Singer Team yet, most of them work perfectly and are maintained frequently. And in the worst case of not finding a suitable tap or target, just grab a code template and start coding for yourself!

Internally we did exactly that, when we couldn’t find a salesforce tap that can grab all the metadata we needed. We just took the salesforce tap by meltano and edited it to collect the response of the .describe() API-function.

Check it out here when you are interested!

What are the key benefits of using Singer?

  • A wide variety of supported systems: With over one hundred taps and nearly a dozen targets, Singer supports a wide range of systems and is powered by python scripts that are regularly reviewed and maintained. Adding to the benefit, the entire range is continuing to grow as community-made taps and targets are made available to the public; all of which can be found on popular repositories like github or gitlab. 
  • Decoupling of source systems and databases: When using Singer as your ETL-tool, you first need to identify all source systems from which you want to load data. After setting up the corresponding taps, you’ll  find a target for your desired database system. As you can combine any tap with any target, the connection to your database must only be set up once, necessary only when configuring the target. Your target can now be combined with all of your taps, so a single point of connection is made. This keeps maintenance requirements low and creates a higher transparency of your ETL-processes. 
  • Standardized Data Exchange Format: Singer uses JSON format to exchange data between taps and targets. This standardization allows the whole system to be as flexible as it is.
  • Incremental Loading – the easy way: Singer supports incremental loading of data by storing the state between two invocations. The timestamp of one invocation can be stored inside a state.json file that gets picked up by the next invocation to start extracting data that arrived after this last timestamp. Some datasources, like MySQL, even support CDC, while most other source systems only support inserts and updates but not deletes.
  • Open Source: Of course, the biggest beneficial feature of Singer is that it’s open source. Singer is released under the Apache License 2.0. This not only means that you or your company can use it for free, it also means Singer has a very active community that is constantly developing new features. This aspect makes Singer even more attractive for small organizations with a low software budget.

To put the above knowledge into better context, we will discuss how to start working with Singer in our next Newsletter. So, don’t forget to sign up for our Newsletter if you haven’t already done so and don’t miss your opportunity to level up your ETL-processes.

-by Tim Kirschke (Scalefree)

Get Updates and Support

Please send inquiries and feature requests to [email protected]

For Data Vault training and on-site training inquiries, please contact [email protected] or register at www.scalefree.com.

To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.

Scalefree

Leave a Reply