You’ve heard it before: tech projects have a tendency to go over time and over budget. Specifically, 45% over budget, 7% over time, and 56% less value than predicted. Your ETL project will probably not be immune. Here are the most common mistakes that go wrong with an ETL project.
Forgetting about long-term maintenance
Some of the future maintenance costs that companies forget about are:
- Data formats changing over time
- An increase in data velocity
- The time cost of adding new data connections
- The time cost of fixing broken data connections
- Requests for new features, including new columns, dimensions, and derivatives
ETL Best Practice #1: Make the ETL buy vs. build decision with care. ETL can be a fun challenge at the beginning, but it gets old fast. Jeff Magnusson takes a strong stance on this topic, “Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.” Even if your developers are excited about the challenge, choose the build approach with caution.
Underestimating data transformation requirements
Raw data from disparate sources almost always needs to be cleaned and normalized before it gets to your data warehouse in order to make sense. Data from different systems typically don’t play well together, and requires work to get them to cooperate. Here are some common examples:
- Establishing key relationships across data sources, many of which might not exist in the raw data
- Updating new values on existing records without sacrificing performance
- Time zone consistency
- Lookup value normalization (US = United States = USA)
If you ignore the transformation step, the data in your warehouse will be impossible to work with and full of inconsistencies, causing your decision makers to lose faith in its accuracy.
ETL best practice #2: While most data warehouses today have the processing capabilities to manage data modeling in the warehouse, you’ll still want a cleansing mechanisms to get the data to a place where it plays nicely with data from other sources.
Foregoing the customer development process
Before you begin, remember that business units are the customers for this project. Who wants the data you are ETLing? What do they want to do with it? Run through the customer development process: do interviews (and not just with managers) and get your hands on their current analyses. If you don’t have this information, you might be surprised by things like:
- Not bringing in the information your users need most
- Failing to support mission-critical reporting workflow
- Anticipating future data needs
ETL best practice #3: Regard your business units as customers and involve them in the development process will help ensure you build the system they need.
Tightly coupling different elements of your data pipeline
Any pipeline that’s responsible for shipping data into a warehouse has more components than you would initially anticipate – data integrations for each raw data source, transformation logic, job scheduling, basically all the components outlined in the ETL architecture section.
Each one of these components and their subcomponents involve independent technical decisions which will impact the scalability, functionality, and maintenance burden down the road. Frequently, first-time data engineers attempt to solve all of these problems with the same technology, but often, each component requires a specialized solution.
ETL best practice #4: Choose the right tool for each component of your stack. This allows you to update components of your ETL process when technologies and business needs change, without having to rebuild from scratch.
Building your ETL process based on your current data scale
The volume and variety of your data is going to increase dramatically. The more successful your company and the more data-reliant your team becomes, the more quickly you’ll go from generating thousands of rows per hour to millions of rows per second.
Is your business about to expand globally? Does your product team want to do analysis on in-app events? Is your business on the cusp of rapid growth? Any of these scenarios could dramatically change the scale of data you’re transporting through your ETL system.
ETL best practice #5: Your ETL system needs to be able to scale; otherwise, you’ll quickly outgrow your infrastructure, and be stuck rebuilding.
Not recognizing the warning signs
It can be easy to overlook sunk costs, and instead, plow ahead as a project gets more and more off-track. The ETL warning signs to look for are similar to the signs of any failing tech project:
- Are you missing deadlines?
- Are your engineers spending more time supporting existing processes than building new ones?
- How do your engineers feel about the project? How about your end users?
ETL best practice #6: If you’re already deep in the building process and realize you made some mistakes, don’t be afraid to admit it, and take a few steps back. You won’t be the first person to get in over their heads on an ETL project.
Focusing on tools and technologies rather than fundamental best practices
The data engineering space is incredibly hot right now; new languages and technologies pop up every minute. There’s a tendency to want to get swept up in the latest fad. Let’s use Spark! Let’s use Kafka! Everyone is using Hadoop! It doesn’t matter what technology or language you choose, the fundamentals of ETL will remain the same.
ETL best practice #7: Focus on getting the basics right and solving the problems of your own unique organization. Tools should be secondary.