At its most basic, the ETL process encompasses data extraction, transformation, and loading. While the acronym implies a neat, three-step process – extract, transform, load – this simple definition doesn’t capture:
- The transportation of data
- The overlap between each of these stages
- How new technologies are changing this flow (e.g. ELT)
It’s better to think of the ETL definition as the business value it provides – ETL is the process by which data is extracted from data sources that are not optimized for analytics, and moved into a central host, and optimized for analytics. The exact steps in that process might change, but the end result is the same.
Traditional ETL Process
Historically, the process has looked very much like this:
Data is extracted from Online Transactional Processing (OLTP) databases, today more commonly known just as transactional databases, and other data sources. Data is then transformed in a staging area. These transformations cover both data cleansing as well as optimizing the data for analysis. The transformed data is then loaded into the Online Analytical Processing (OLAP) database, today more commonly known just as analytics databases, as OLAP summaries.
Business intelligence (BI) teams then request that the data warehouse team run queries on that data, which would eventually be presented to the end user, or the individual responsible for making business decisions based on that data. One common problem encountered here is if the OLAP summaries can’t support the type of analysis the BI team wants to do, then the whole process needs to run again, this time with different transformations.
Modern ETL Process
Modern technology has changed the way that ETL is approached. This change has taken place for several reasons. The biggest one though is the advent of powerful analytics warehouses like Amazon Redshift and Google BigQuery.
These newer technologies mean that analytic databases can now perform transformations in-place rather than requiring a special staging area. Also, data is frequently analyzed in raw form rather than pre-loading OLAP summaries. This has led to to the development of ETL systems that are more lightweight, flexible, and transparent. They look something like this:
The biggest advantage to this setup is that transformations and data modeling happen in the analytic database, in SQL. This gives the BI team, data scientists, and analysts interacting with the data far greater control over how they work with it, in the common language they all understand.
Critical ETL Components
Regardless of the exact ETL process you choose. There are some critical components that you’ll want to consider:
- Support for Change Data Capture (CDC) (aka binlog replication): Incremental loading allows you to update your analytics warehouse with new data without doing a full reload of the data. We’ll say more about this in the ETL Load section.
- Auditing and Logging: It’s important that detailed logging happen within the ETL pipeline to ensure that data can be audited after it is loaded and errors can be debugged.
- Handling of multiple source formats: you’ll likely want to pull in data from sources like Salesforce’s API, your backend financials, as well as MySQL, MongoDB. Your process needs to be compatible with data that shows up in a variety of formats.
- Fault Tolerance: depending on the type of system you’re using this may take different forms, but all ETL systems must be fault tolerant. As these are production systems, problems will inevitably occur. Systems need to be able to recover gracefully, making sure that data will still be able to make it from one end to the other of the pipeline even when the first run encountered problems.
- Notification support: if you want your organization to trust its analyses, you have to build in notification systems to alert you when data isn’t accurate. These might include:
- Proactive notification directly to end-users when API credentials expire
- If a third-party API reports an issue, we retry the query with exponential backoff before giving up and marking it as an error with them
- If there’s an unexpected error in a connector, we abort the job and automatically create a ticket to have an engineer look into it
- Utilizing systems-level monitoring for things like errors in networking or databases so that we’re automatically alerted to system-wide disruptions
- Low latency: Some decisions need to be made in real time, so data freshness is critical. While there will be latency constraints imposed by particular source data integrations, data should flow through your ETL process with as little latency as possible.
- Scalability: As your company grows, so will your data volume. All components of an ETL process should scale to support arbitrarily large throughput.
- Accuracy: Data cannot be dropped or changed in a way that corrupts its meaning. Every data point should be auditable at every stage in your process.
Click any of the buttons below for more detail about that step in the ETL process:
Keep Learning about the ETL Process
- Engineers Shouldn’t Write ETL – “In case you did not realize it, Nobody enjoys writing and maintaining data pipelines or ETL. It’s the industry’s ultimate hot potato.” Writes Jeff Magnusson in this excellent writeup on how to structure data science teams.
- ELT vs ETL – This whitepaper explains the benefits of performing transformations after the data is loaded into a data warehouse.
- Overview of Extraction, Transformation, and Loading – This explanation from Oracle is particularly interesting because it explains why the term ETL is too simplistic, and how it suggests that each phase is distinct. Depending on the data source’s system capabilities, the chosen way of transportation, etc., transformations can happen many times throughout the ETL process.
- The Data Infrastructure Meta-Analysis – An analysis of how today’s top tech companies are building their ETL workflows, from data extraction to insights.