Beyond ETL and Data Warehousing
Originally published in Information
Management on February19, 2009.
Data integration suffers from an image problem. It has become synonymous
with extract, transform and load. Likewise, ETL has been regarded as a
data warehousing technology. Both of these viewpoints fail to reflect
current capabilities, and they greatly inhibit enterprises in their attempt
to integrate data to provide the information their business needs. Because
of this short-sightedness, companies have lost opportunities to harness
information as a corporate asset. It increases the cost of integrating
data, encourages the creation of data silos and forces businesspeople
to spend an inordinate amount of time filling the information gaps themselves
through data shadow systems or reconciling data.

More than Simply ETL Tasks
The basic tasks required in data integration are to gather data, transform
it and put it into its target location. If that sounds like ETL, that’s
because that’s exactly what it is (see Figure 1). ETL tools have automated
these tasks and empowered developers with a toolkit beyond what they could
have easily hand coded themselves. ETL tools, for example, include prebuilt
transformations from the elementary tasks of converting data and performing
lookups to the more complex processes of change data capture and slowly
changing dimensions. These prebuilt transformations greatly enhance developer
productivity and improve the consistency of results. Data integration
tools offer many significant processes and technologies that extend beyond
the basic ETL tasks (see Figure 2). These extensions are necessary to
turn data into comprehensive, consistent, clean and current information.
The extended processes include data profiling, data quality and operational
processing. This creates the ability to determine the state of the source
systems, perform cleansing, ensure consistency and manage all the processing,
including error handling and performance monitoring.

Data integration suites have expanded to incorporate enterprise application
integration, enterprise information integration and service-oriented architecture
coupled with ETL to offer data integration in batch, interoperating with
applications, or in real time from business intelligence applications.
More than Batch Processes
There are many integration initiatives in an enterprise. With data integration
being associated with batch-driven ETL processes that load a data warehouse
(DW), integration initiatives that did not involve a DW sought out other
technologies. These technologies included EAI, EII and SOA. Although each
of these technologies has fundamental applications, the reality is that
organizations had to reinvent the wheel for every data integration task.
The results have been integration silos built with different technologies,
producing inconsistent business information and generally with data integration
built as an afterthought.
The good news is that data integration vendors that now combine all
of the above technologies into data integration suites have emerged from
the ETL ranks. These suites enable an enterprise to integrate data in
one consistent manner, yet deploy using whatever transport technology
(i.e., ETL, EAI, EII or SOA) is appropriate.
More than Data Warehousing
With the emergence of the more powerful suites, data integration has
moved beyond data warehousing to include other integration initiatives
in an enterprise, such as:
- Data migration,
- Application consolidation,
- Operational and real-time BI and
- Master data management, customer data integration and product information
management.
Companies often undertake data migration or application consolidation
projects because of mergers and acquisition or because they need to streamline
applications. In the past, these projects were seen as one-offs and typically
hand coded. As systems integrators became proficient in ETL tools from
DW projects, they realized that they would be much more productive at
data migrations and application consolidation projects if they used these
same data integration tools. Even though they are one-offs, data integration
tools enabled the SIs to reuse code, leverage prebuilt transformations,
better manage processes and produce documentation without a laborious
manual effort. In addition, they did not have to deploy a cadre of coding
gurus but could leverage the data integration developers they already
employed.
Several market forces have converged to produce the perfect storm, enabling
operational or real-time BI with the same data integration and BI tools
as used in DW projects. These forces include enterprise applications built
on relational databases and data integration tools no longer bound to
batch ETL constraints. In addition, with the major enterprise application
vendors also offering data integration and BI tools, these vendors are
bundling this convergence is more consistent, comprehensive and current
information (business benefit) with the same data integration and BI infrastructure
(IT benefit).
MDM, CDI and PIM all deal with conforming and maintaining master data
or reference data for data subjects, such as customers and products. The
initial wave of technology solutions bundled a set of tools and applications
that were business process or industry specific. What got lost in many
of the initial implementations was that these applications relied heavily
on data integration and that it made sense to leverage a company’s
existing data integration platform to create MDM, CDI and PIM solutions.
Moving to Pervasive Data Integration
Data integration tools aren’t pervasive yet, despite the fact that
their use is a best practice. The primary inhibitors have been cost and
resources, lack of understanding of the tool capabilities and a market
unawareness of tool offerings.
Although Fortune 1000-size corporations tend to use these tools to build
their data warehouses, they still hand code their data marts, online analytical
processing cubes and other reporting databases. The barriers to pervasive
use in these enterprises include:
- Expense. The licensing costs of these tools often inhibit
more widespread use.
- Resources. Often, the data integration developers are
tied up in DW development and are not available for other database loading.
- Fit. The data integration tool selected as the corporate
standard may not be the best match for creating the data marts, cubes
and reporting databases required by individual groups.
The issue of expense is best addressed when selecting a tool or negotiating
pricing with the tool vendor. Licensing cost should not be the barrier
it once was with many vendors offering more scalable pricing options.
Companies can address the issue of resources by forming a data integration
competency center. Identifying and elevating the visibility of the extent
of the data integration tasks occurring throughout an enterprise will
enable IT to justify either more resources or a prioritization of all
integration projects so that the enterprise DW does not consume all available
resources.
The issue of fit is political for large enterprises because the solution
is to go against the grain of selecting one corporate standard. The pragmatic
solution would be to create two corporate standards: one for enterprise-class
data integration and the other for “downstream” databases
such as data marts or cubes. The downstream databases do not require the
many sources, data cleansing or data conforming that data warehousing
requires. Most of these downstream databases could be loaded by more cost-
and resource-effective ETL solutions than required for the enterprise.
Not only do corporate IT departments balk at this solution, but tool
vendors often think they should “own” the account and can
do everything that every other tool can do. The pragmatic answer I give
both groups is that without the two-tier standard approach, the downstream
databases will continue to be hand coded with all the business and IT
costs and risks associated with those applications. In addition, more
data shadow systems will arise to plug the information gap because hand
coded applications will take longer and incur more maintenance costs as
they age.
For firms smaller than the Fortune 1000, such as those with annual revenue
between $250 million and $1.5 billion, the inhibitors to using data integration
tools are costs, resources and being unaware of the breadth of the data
integration market. These firms do not have the IT budget or resources
of the Fortune 1000 to dedicate to data integration solutions. Although
this means they would spend less on solutions, it should not mean they
are forced to hand code.
However, they might not be aware of the tools – at least those
that are in their budget. If they are aware of any data integration tools
at all, it is most likely that the high-end tools are expensive and require
highly skilled resources. They are aware of these tools because industry
analysts and publications seem to only mention the expensive ones. They
also hear about them from employees who used to be employed by Fortune
1000 firms. From their perspective, you either have to pay for high-end
tools or you hand code, and hand coding usually wins out.
The data integration tool market includes products supporting a wide
range of capabilities, skills and costs. As discussed, there may be multiple
products available that would be an excellent fit for a midmarket firm’s
needs, skills and budget.
Next Steps
If you work in a Fortune 1000 company, ask yourself if you’re stuck
in the ETL rut and if business groups throughout your enterprise are hand
coding or, worse, building data shadow systems. If your company is smaller
than the Fortune 1000, determine if you are hand coding your data integration
processes and why.
After you get your answers, become an advocate to make data integration
more pervasive in your enterprise and unleash the information your business
needs.
Data Profiling
Data profiling is the examination and assessment of your source systems’
data quality, integrity and consistency. Sometimes called source systems
analysis, it helps ensure your data integration works properly so you
can avoid being surprised by bad data.
Data profiling tools, which automate the process, assist your staff in
truly understanding not just the data definitions but, more importantly,
the real content of the data. Without these tools, data profiling requires
writing hand-coded SQL statements and comparing query results with expected
results. Not only is this laborious and time-consuming, it is likely to
be incomplete because people generally will not hand code every permutation
of a source system’s tables, views and columns. Data warehousing
and BI projects are often late or encounter data surprises when the staff
hand codes data profiling. It’s even worse when, overwhelmed, people
skip it completely.
Data profiling should be established as a best practice for every data
warehouse, BI, and data migration project. In addition to meeting project
requirements, data profiling should be an ongoing activity to ensure that
you maintain data quality levels.
|