Ads by TechWords

See your link here
Subscribe to our e-mail newsletters
For more info on a specific newsletter, click the title. Details will be displayed in a new window.
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
More E-Mail Newsletters 
 

Data Scrubbing

February 10, 2003 12:00 PM ET

Computerworld - The need to scrub data is made pretty clear by simple questions like this one: Are Jerry L. Jonson of 16 Clarke St., Altuna, PA, and Gerry L. Johnson of 16 Clark Street, Altoona, Penn., the same person? You would probably say that most likely they are. But a computer, without help from specialized software, would deal with the information as though it were about two different guys.

The human eye and mind recognize that the differences between the two sets of data records are probably the result of mistakes or inconsistencies in data entry. Weeding out and fixing or discarding inconsistent, incorrect or incomplete data is what's called data scrubbing or cleansing.

Data Scrubbing
Credit: Melinda Beck
"Dirty data" has been a problem for as long as there have been computers -- or maybe for as long as people have attempted to gather and analyze information. It's a large part of the "garbage in" that can result in the worthless "garbage out" of a computing process.

The issue of data hygiene has become increasingly important as more and more corporations implement complex customer relationship management (CRM) systems and build data warehouses that merge information from many different sources.

Without data cleansing, the IT staffs of those companies face the unappetizing prospect of merging corrupt or incomplete bits of data from multiple databases. A single piece of dirty data might seem like a trivial problem, but if you multiply that "trivial" problem by thousands or millions of pieces of erroneous, duplicated or inconsistent data, it becomes a prescription for chaos.

Sources of Dirty Data

In its 2001 report about organizations implementing data warehouses for the purpose of business intelligence, Cutter Consortium identified the following causes of dirty data:

• Poor data entry, which includes misspellings, typos and transpositions, and variations in spelling or naming.

• Data missing from database fields.

• Lack of companywide or industrywide data coding standards (a big problem in health care, for example).

• Multiple databases scattered throughout different departments or organizations, with the data in each structured according to the idiosyncratic rules of that particular database.

• Older systems that contain poorly documented or obsolete data.



Additional Resources

POLL RESULTS
Accelerate your knowledge of the IT world you inhabit by viewing the results of a series of polls taken by your IT peers. These polls of 100+ IT professionals each are available for full viewing. They cover key topics such as virtualization, processor performance, green IT, cloud computing and many others. Be a part of the buzz.
WHITE PAPER
Technology is complex. Keeping it running productively shouldn't be. To that end, you want to minimize the number of solutions needed in-house to simplify operations, maintenance, and support. Kodak offers a best-practices model. One company provides support for both scanner and software, for fast problem resolution without vendor finger-pointing. Download now!
WHITE PAPER
Utilizing demand intelligence improves the precision of pricing, product assortments, channel/store placement, and promotion, which are all essential for sustainable revenue management performance. Learn more, download this free whitepaper today.

What People Are Saying

White Papers & Webcasts

Tech Horizons: ASG's metaCMDB, The Technology That Rocks
Improved business productivity often requires more efficient IT and more efficient IT cannot be achieved without a better understanding of the way business...  

How to Reduce Eclipse BIRT Development Effort for Data Visualizations
Web applications can come with a long list of visualization requirements for structured data. By delivering your output through the BIRT Interactive Viewer,...

Mitigating Litigation Risk with Email Management Tools
Does your company have an email retention policy that protects it when litigation occurs? IDC discusses effective email retention policies and the role...  

Legacy IT Modernization - Practical Reality
(Source: BluePhoenix) Corporate budgets continue to tighten. Organizations are looking at ways to reduce operating costs and eliminate unnecessary expenses while at the...

Sun GlassFish Portfolio - Deploy Web Applications with Open Source
As enterprises struggle to develop and deliver new and more dynamic services to more people, they must do so with severe budget constraints....  

Interactive Guide: Getting Started with Data Governance
In this online interactive guide, Andrew White, Research VP with lead analyst firm Gartner, answers these questions to help get you on the...

The necessary convergence of IT and Facilities
If IT and Facilities could work collaboratively, organizations can operate more efficiently and effectively while still meeting their business objectives. That's why EatonĀ®...  

Why Now is the Right Time for the Linux Desktop
(Source: Novell) Faced with tighter budgets, enterprises are rethinking their desktop strategies to deliver the same - if not better - services and...

Is your data center ready for virtualization?
Virtualization can deliver dramatic benefits for data centers, but it can also stress the underlying support infrastructure. Power and cooling systems - which...  

Agile Enterprise Content Management (ECM) for Rapid ROI
(Source: IBM) Content rich business processes are a core feature of daily operations at just about any organization today. Very often these essential...

 

SAS Information Management Kit

SAS is the leader in business intelligence and analytical software and services. Only SAS offers leading data integration, storage, analytics and business intelligence applications within a comprehensive enterprise intelligence platform. SAS gives 97 of the top 100 companies in the 2007 Fortune 500 THE POWER TO KNOW®.

Webcast: The Information Management Roadmap
Imagine high-quality data, cleansed, analyzed and delivered throughout your organization. Join Computerworld, IT visionary Thornton May and a panel of experts to learn how SAS® can help you make it happen.

View this webcast 
Research Report: Information Management Initiatives at Midsize and Large Organizations
See the top-line results of this Computerworld sponsored survey to see how IT and business leaders are handling information management implementation.

Download this report 
White Paper: Information Management: Better Information for Winning Decisions.
This white paper explains how the SAS Information Evolution Model aids companies in assessing how they use this information to make strategic decisions and drive business.

Download this white paper