The Trial-and-Error Method for Data Integration
Originally published in Information Management on February 2007.
In The Trial-and-Error Method for Data Architecture I discussed some common data architecture mistakes. In this article, I point out areas where companies tend to go wrong in data integration or extract, transform and load (ETL) processing.
1. Not developing an overall architecture and workflow.
The usual development approach for data integration is to gather the data requirements, determine what data is needed from source systems, create the target databases such as a data warehouse and then code. This is an incomplete, bottom-up approach. It needs to be coupled with a top-down approach that emphasizes an overall data integration architecture and workflow.
Some of the design considerations often lost with a bottom-up approach include:
- How and where do you implement a refresh (replacing all your data) rather than a change data capture (CDC) approach?
- How do you process dimensional data, such as products and customers, in relation to facts, such as business transactions?
- When do you filter and aggregate data?
When do you use staging tables, and are they persistent or transient?
- How do you handle data quality?
- Do you reject, suspend or flag data with perceived data quality problems?
- How do you handle the changes and gaps in historical data?
Failing to address these considerations in the beginning can delay your project, increase costs, reduce perceived data quality and cause business users to question the value of your business intelligence (BI)/data warehousing (DW) efforts.
2. Thinking that data quality is a product rather than a process.
People often assume that data quality problems are simply data errors or inconsistencies in the transactional systems that can be fixed with data quality products. They overlook and, therefore, don't try to prevent the fact that problems arise when you integrate data from disparate source systems into a data warehouse.
They're not seeing that many data quality problems are really data consistency and integrity issues that arise when you integrate data from multiple transaction systems. The differences in dimensional data, such as product (part IDs, code and hierarchy), customers (business and/or people), suppliers, partners and employees become an issue on the enterprise level, which is precisely the target for DW and performance management reporting and analysis.
Even when data quality problems are not your fault, you still need to take responsibility to proactively measure, monitor and report on data quality metrics as you load your DW and data marts. You might not own fixing the problems, but you certainly own measuring them. The business should not make decisions using enterprise data of unknown quality.
Measure data quality by obtaining data quality requirements in your initial business requirements phase, incorporating data quality metrics into your architecture, monitoring those metrics in all your data integration processes and reporting on data quality so the business users understand the data on which they are basing their decisions.
3. Assuming custom coding is faster than ETL development.
While most large enterprises have embraced ETL development as a best practice, the reality is that custom coding is still prevalent, especially in the following areas:
- The small-to-medium business market is not using ETL tools as extensively as their larger brethren.
- Companies that have used database vendors' ETL tools (particularly older versions of these tools, such as Microsoft DTS and Oracle Warehouse Builder) may be using them just to run SQL scripts or store procedures. Although they are technically using ETL tools, in reality they are writing custom code and just using the ETL tool to run their code.
- Many BI applications need to build summary or aggregation tables (these should be data marts, but many times they are not). These reporting tables are often built using SQL scripts or stored procedures. The BI developers know SQL, feel comfortable in creating tables with it and do not feel the need to use an ETL tool.
- Business groups are using custom coding to create countless data shadow systems in large and small enterprises (defined in my April 2004 column). They may have asked for a flat file extract from IT, or they used a BI tool to create their own flat file extract. They then use a combination of Microsoft Access and Microsoft Excel to perform their ETL work.
In all the cases just mentioned, I get the usual feedback. "Rick, how can it be faster to code with the ETL tool, and how can the ETL code perform as well as the code I create?" Without debating custom coding versus the merits of ETL development, let's just say that when I see a large block of custom code, I see a great opportunity for change and improvement. But custom coding is often below the radar because most just assume that everyone is using ETL tools.
In general, people tend to take an oversimplified view of data integration. By not seeing it as a process to be managed from a top-down perspective, they tend to run into problems and learn the hard way - through trial and error.