|
|
|
Seven Misconceptions about Data Quality
Article published in
DM Direct
June 3, 2004 Issue
By Rick Sherman
The narrow definition of data quality is that it's about bad data
- data that is missing or incorrect. A broader definition is that
data quality is achieved when a business uses data that is comprehensive,
consistent, relevant and timely. If you focus only on the narrow
data definition you may be lulled into a false security when, in
fact, your efforts fall short. We will address several more misconceptions
about data quality.
In order to fix a problem you have to recognize you have a problem.
According to recent Gartner research, 25 percent of Fortune 1000
companies are working with poor quality data. The Data Warehousing
Institute (TDWI) estimated that data quality problems cost U.S.
businesses $600 billion each year. Regulatory initiatives such as
Sarbanes-Oxley and Basel II dictate that companies must provide
transparent data. But even with the documented high costs of poor
data quality and the tight regulatory environment, many companies
are turning a blind eye to their data quality problems. Why? Perhaps
it is because of their mistaken belief that bad data is the only
data quality issue they need to worry about.
A corollary to the above: to fix a problem you first have to take
responsibility for it. That's the rub. Taking responsibility is
the biggest roadblock to dealing with data quality. In order to
achieve a high level of quality, data has to be viewed from an enterprise
and holistic perspective. Data may be correct within each data silo,
but the information will not be consistent, relevant or timely when
viewed across the enterprise. To make matters worse, you've got
each report or analysis interpreting the data differently, so even
when the numbers start off the same in each silo, the end results
will not be consistent. Data is a corporate asset and has to be
consistent across the entire corporation, not just within the business
function or division where it originated.
Misconception #1: You Can Fix Data
Fixing implies that there was something wrong with the original
data, and you can fix it once and be done with it. In reality, the
problem may have been not with the data itself, but rather in the
way it was used. When you manage data you manage data quality. It's
an ongoing process. Data cleansing is not the answer to data quality
issues. Yes, data cleansing does address some important data quality
problems has and offers a solid business value ROI, but it is only
one element of the data quality puzzle. Too often the business purchases
a data cleansing tool and thinks the problem is solved. In other
cases, because the cost of data cleansing tools is high, a business
may decide that it is too expensive for them to deal with the problem.
Misconception #2: Data Quality is an IT Problem
Data quality is a company problem that costs a business in many
ways. Although IT can help address the problem of data quality,
the business has to own the data and the business processes that
create or use it. The business has to define the metrics for data
quality - its completeness, consistency, relevancy and timeliness.
The business has to determine the threshold between data quality
and ROI. IT can enable the processes and manage data through technology,
but the business has to define it. For an enterprise-wide data quality
effort to be initiated and successful on an ongoing basis, it needs
to be truly a joint business and IT effort.
Misconception #3: The Problem is in the Data Sources or
Data Entry
Data entry or operational systems are often blamed for data quality
problems. Although incorrectly entered or missing data is a problem,
it is far from the only data quality problem. Also, everyone blames
their data quality problems on the systems that they sourced the
data from. Although some of that may be true, a large part of the
data quality issue is the consistency, relevancy and timeliness
of the data. If two divisions are using different customer identifiers
or product numbers, does it mean that one of them has the wrong
numbers or is the problem one of consistency between the divisions?
If the problem is consistency, then it is an enterprise
issue, not a divisional issue. The long-term solution may
be for all divisions to use the same codes, but that has to be an
enterprise decision.
The larger issue is that you need to manage data from its creation
all the way to information consumption. You need to be able to trace
its flow from data entry, transactional systems, data warehouse,
data marts and cubes all the way to the report or spreadsheet used
for the business analysis. Data quality requires tracking, checking
and monitoring data throughout the entire information ecosystem.
To make this happen you need data responsibility (people), data
metrics (processes) and meta data management (technology). (We'll
address how in a future column.)
Misconception #4: The Data Warehouse will Provide a Single
Version of the Truth
In an ideal world, every report or analysis performed by the business
exclusively uses data sourced from the data warehouse - data that
has gone through data cleansing and quality processes and includes
constant interpretations such as profit or sales calculations. If
everyone uses the data warehouse's data exclusively and it meets
your data quality metrics then it is the single version of the truth.
However, two significant conditions lessen the likelihood that
the data warehouse solves your data quality issues by itself. First,
people get data for their reports and analysis from a variety of
data sources - data warehouse (sometimes there are multiple data
warehouses in an enterprise), data marts and cubes (that you hope
were sourced from the data warehouse). They also get data from systems
such as ERP, CRM, and budgeting and planning systems that may be
sourced into the data warehouse themselves. In these cases, ensuring
data quality in the data warehouse alone is not enough. Multiple
data silos mean multiple versions of the truth and multiple interpretations
of the truth. Data quality has to be addressed across these data
silos, not just in the data warehouse.
Second, data quality involves the source data and its transformation
into information. That means that even if every report and analysis
gets data from the same data warehouse, if the business transformations
and interpretations in these reports are different then there still
are significant data quality issues. Data quality processes need
to involve data creation; the staging of data in data warehouses,
data marts, cubes and data shadow systems; and information consumption
in the form of reports and business analysis. Applying data quality
to the data itself and not its usage as information is not sufficient.
Misconception #5: The ERP System will Provide a Single
Version of the Truth
Ditto what I said for Misconception #4.
Misconception #6: The Corporate Performance Management
(CPM) System will Provide a Single Version of the Truth
Ditto what I said for Misconception #4.
Misconception #7: BI Standardization will Eliminate the
Problem of Different "Truths" Represented in the Reports
or Analysis
Yes, standardizing on BI tools can save money and may be a worthwhile
project. But, don't lose sight of the fact that the use of different
BI tools is a symptom of a data quality problem, not the cause.
If you pull the same data and implement the same transformations
(formulas) in different BI tools you get the same results. The report,
chart or dashboard may look a little different, but the numbers
would be the same. The problem, therefore, is not that different
BI tools are being used, but that each project implementing these
tools built a different data mart or cube and then applied different
formulas in their reports or analysis. Using the same BI tool in
different projects that use different data with different transformations
is still going to yield different results - and hence the data quality
issues still remain. The cause of the data quality issues was the
lack of consistency between the data used and data transformations,
not the use of different BI tools.
Data quality is defined as comprehensive, consistent, relevant
and timely data for use by the business. Don't shrug it off as issue
of bad data entry. Data needs to be addressed on an enterprise scale
and in a holistic manner incorporating people, processes and technology.
About Athena IT Solutions
Athena IT Solutions provides data
warehousing and business intelligence consulting services to
help businesses increase the return on investment of their corporate
data. Athena IT Solutions founder Rick Sherman has more than 17
years of business intelligence and data warehousing experience,
having worked on more than 50 implementations as an independent
consultant and as a director/practice leader at a Big Five firm.
He founded Athena IT Solutions, a Boston-based business intelligence
and data warehousing consulting firm and is a published author,
industry speaker, instructor and consultant. He can be reached at
rsherman@athena-solutions.com
or (617) 835-0546.
|
|