Sidebar: Choosing the Data Quality Techniques Your Company Needs

When it comes to your data, how clean is clean enough?

"No serious information system has a data quality of 100%," data quality guru Ken Orr said in his report "Data Quality and Systems Theory," published in 1998. Nothing has changed in the intervening five years.

"The real concern with data quality is to insure not that the data quality is perfect, but that the quality of the data in our information systems is accurate enough, timely enough and consistent enough for the organization to survive and make reasonable decisions," Orr wrote.

A database restricted in complexity and use might need nothing more than data cleansing, which might correct a misspelling or change "Avenue" to "Street." It won't necessarily recognize the subtle relationships between records that are needed in order to build and maintain accurate 360-degree views of customers, parts, locations and events.

As the complexity of the database and the diversity of data sources increase, you need increasingly more sophisticated data tools.

Extract, transform and load (ETL) tools periodically extract data from sources, transform it into a common format and then load it into a target database. However, unless the data that the ETL tool receives has first been cleansed, it will output incorrect data, otherwise known as the "code, load and explode" phenomenon.

A common perception is that data quality is a simple matter of data being clean or dirty, says Stephen Brown, executive director of product management and marketing at Ascential Software Inc. in Westboro, Mass. "Data doesn't have a dualistic nature," he says.

Say, for example, a phone number is accurate for one transaction. Then area codes change. The number is still accurate for that transaction, but not for future ones.

For many companies, such as Monster Worldwide Inc. in Maynard, Mass., data cleansing and ETL tools weren't letting the company get its money's worth from its data warehouse.

"Cleansing the company's own data wasn't doing the whole job," says Steve Deeb, Monster.com's vice president for CRM. "To give the sales organization the support they need, it's critical that they have not only clean data but also correct data."

A data quality tool will integrate with demographic and firmagraphic (data about companies) sources and provide prebuilt gateways to enterprise applications such as CRM or ERP.

Data quality tools owe a conceptual debt to Dr. Howard Newcombe's work on record linkage and to the U.S. Census Bureau, which developed the first probabilistic record-linkage application in the 1960s.

Record linkage is a method of matching records in multiple databases in which similarities and differences between records are computed, encoded and evaluated to determine whether they relate to a common entity, such as an individual, family, event, business or address.

A probabilistic data quality tool audits data by generating statistics about the contents of data fields. Parsing identifies floating fields -- data elements that have been inserted into inappropriate fields -- and places them into the correct fields. Householding identifies which individuals belong to the same household or company. Matching identifies duplicate or related entities.

Common matching algorithms include the following:

  • Key code: Matches the first few characters in a field.
  • Soundex: Matches words based on their pronunciation.
  • Fuzzy: Matches records by the degree to which the algorithm determines they are alike.
  • Weighted: Matches by fields in the order that users indicate are most important.

Ascential's data quality tool has 24 prebuilt matching algorithms, which the probabilistic engine adjusts dynamically in response to changing characteristics in the data, Brown says.

Ascential adds to its data quality tool set a proprietary strategy called parallelism, which lets the software link dozens, or even hundreds, of cheap, fast CPUs and partition a large database many times to match records on the fly, Brown says.

But no matter how thorough the initial data quality effort is, it's not a one-time effort, according to Orr. As time progresses and everything in the world changes, the data in the database remains static, which means it's continually growing less accurate, unless users continue to provide feedback.

Lais is a Computerworld contributing writer in Takoma Park, Md.

Copyright © 2003 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon