
Subscribe to
Computerworld
|
February 10, 2003 (Computerworld) -- The need to scrub data is made pretty clear by simple questions like this one: Are Jerry L. Jonson of 16 Clarke St., Altuna, PA, and Gerry L. Johnson of 16 Clark Street, Altoona, Penn., the same person? You would probably say that most likely they are. But a computer, without help from specialized software, would deal with the information as though it were about two different guys.
The human eye and mind recognize that the differences between the two sets of data records are probably the result of mistakes or inconsistencies in data entry. Weeding out and fixing or discarding inconsistent, incorrect or incomplete data is what's called data scrubbing or cleansing.

![]()
Credit: Melinda Beck ![]()
The issue of data hygiene has become increasingly important as more and more corporations implement complex customer relationship management (CRM) systems and build data warehouses that merge information from many different sources.
Without data cleansing, the IT staffs of those companies face the unappetizing prospect of merging corrupt or incomplete bits of data from multiple databases. A single piece of dirty data might seem like a trivial problem, but if you multiply that "trivial" problem by thousands or millions of pieces of erroneous, duplicated or inconsistent data, it becomes a prescription for chaos.
Sources of Dirty Data
In its 2001 report about organizations implementing data warehouses for the purpose of business intelligence, Cutter Consortium identified the following causes of dirty data:
Poor data entry, which includes misspellings, typos and transpositions, and variations in spelling or naming.
Data missing from database fields.
Lack of companywide or industrywide data coding standards (a big problem in health care, for example).
Multiple databases scattered throughout different departments or organizations, with the data in each structured according to the idiosyncratic rules of that particular database.
Older systems that contain poorly documented or obsolete data.
|
|
Print this Story |
|
Send Us Feedback |
|
E-mail this Story |
|
Digg this Story |
|
Slashdot this Story |
|
|
|
|
|
|
|
All Zones Application Performance Zone Enterprise-Class Security Zone The File Data Management Zone Grid Computing on Windows Zone Security Management Zone ITIL Best Practices Zone The SAS Zone Storage Virtualization Zone |
|
|
| ||||||||
| ||||||||
| ||||||||
|


Intercept Spam & Viruses With MessageLabs MessageLabs is offering a complimentary 30 day trial of its managed Anti-virus and Anti-spam security solutions. MessageLabs guarantees complete protection against all know and unknown email threats. By providing 24 hour support, your business can increase productivity and decrease risk. Register for a complimentary trial and receive a free datasheet.Download this white paper now!
|

| XenServer FREE trial Citrix XenServer is the simplest and most effective way to virtualize and provision servers. XenServer combines comprehensive server virtualization capabilities with unparalleled scalability, performance, economics, and ease-of-use. Based on the open source Xen hypervisor, XenServer delivers fast performance, easy management, and advanced features such as live migration. |

Since You AskedA weekly storage column from storage analyst, Steve Duplessie of the Enterprise Strategy Group |
|
SAS Information Management Kit
SAS is the leader in business intelligence and analytical software and services. Only SAS offers leading data integration, storage, analytics and business intelligence applications within a comprehensive enterprise intelligence platform. SAS gives 97 of the top 100 companies in the 2007 Fortune 500 THE POWER TO KNOW®. |
Webcast: The Information Management Roadmap
Imagine high-quality data, cleansed, analyzed and delivered throughout your organization. Join Computerworld, IT visionary Thornton May and a panel of experts to learn how SAS® can help you make it happen. View this webcast
|
Research Report: Information Management Initiatives at Midsize and Large Organizations
See the top-line results of this Computerworld sponsored survey to see how IT and business leaders are handling information management implementation.
Download this report
|
White Paper: Information Management: Better Information for Winning Decisions.
This white paper explains how the SAS Information Evolution Model aids companies in assessing how they use this information to make strategic decisions and drive business.
Download this white paper
|
| About Us Advertise Contacts Editorial Calendar Help Desk Jobs at IDG Privacy Policy Reprints Site Map |
|
CIO The Industry Standard |
