Ads by TechWords

See your link here
Receive the latest technology news and information.
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
Cloud Computing
View all newsletters




Privacy Policy
 

Data Scrubbing

February 10, 2003 12:00 PM ET

Computerworld - The need to scrub data is made pretty clear by simple questions like this one: Are Jerry L. Jonson of 16 Clarke St., Altuna, PA, and Gerry L. Johnson of 16 Clark Street, Altoona, Penn., the same person? You would probably say that most likely they are. But a computer, without help from specialized software, would deal with the information as though it were about two different guys.

The human eye and mind recognize that the differences between the two sets of data records are probably the result of mistakes or inconsistencies in data entry. Weeding out and fixing or discarding inconsistent, incorrect or incomplete data is what's called data scrubbing or cleansing.

Data Scrubbing
Credit: Melinda Beck
"Dirty data" has been a problem for as long as there have been computers -- or maybe for as long as people have attempted to gather and analyze information. It's a large part of the "garbage in" that can result in the worthless "garbage out" of a computing process.

The issue of data hygiene has become increasingly important as more and more corporations implement complex customer relationship management (CRM) systems and build data warehouses that merge information from many different sources.

Without data cleansing, the IT staffs of those companies face the unappetizing prospect of merging corrupt or incomplete bits of data from multiple databases. A single piece of dirty data might seem like a trivial problem, but if you multiply that "trivial" problem by thousands or millions of pieces of erroneous, duplicated or inconsistent data, it becomes a prescription for chaos.

Sources of Dirty Data

In its 2001 report about organizations implementing data warehouses for the purpose of business intelligence, Cutter Consortium identified the following causes of dirty data:

• Poor data entry, which includes misspellings, typos and transpositions, and variations in spelling or naming.

• Data missing from database fields.

• Lack of companywide or industrywide data coding standards (a big problem in health care, for example).

• Multiple databases scattered throughout different departments or organizations, with the data in each structured according to the idiosyncratic rules of that particular database.

• Older systems that contain poorly documented or obsolete data.



Jump to comments

data cleansing

Additional Resources

EFD vs. HDD - What You Need to Know
WHITE PAPER
Enterprise flash drives provide a new Tier 0 storage layer capable of delivering high I/O performance at a very low latency. Proper use of EFDs in an Oracle environment can deliver increased performance compared to fibre channel drives. Read the recommendations for identification of the best DB components for EFDs.
Gartner Research Report: Magic Quadrant for Application Delivery Controllers, 2009
WHITE PAPER
The market for products to improve the delivery of application software over networks remains dynamic and innovative. Vendors focused on solving enterprises' most-pressing application problems have become the top players.
Eight Criteria for Server Load Balancing
WHITE PAPER
Server load balancers are a simple yet highly effective means to scale an application environment while ensuring its availability. Today's solutions should also address application performance and security. Read about the top eight criteria you should consider when choosing a server load balancer and how Citrix NetScaler meets those requirements.

What People Are Saying

IT Jobs

 

SAS Information Management Kit

SAS is the leader in business intelligence and analytical software and services. Only SAS offers leading data integration, storage, analytics and business intelligence applications within a comprehensive enterprise intelligence platform. SAS gives 97 of the top 100 companies in the 2007 Fortune 500 THE POWER TO KNOW®.

Webcast: The Information Management Roadmap
Imagine high-quality data, cleansed, analyzed and delivered throughout your organization. Join Computerworld, IT visionary Thornton May and a panel of experts to learn how SAS® can help you make it happen.

View this webcast 
Research Report: Information Management Initiatives at Midsize and Large Organizations
See the top-line results of this Computerworld sponsored survey to see how IT and business leaders are handling information management implementation.

Download this report 
White Paper: Information Management: Better Information for Winning Decisions.
This white paper explains how the SAS Information Evolution Model aids companies in assessing how they use this information to make strategic decisions and drive business.

Download this white paper