Managing Data Madness

This article is excerpted from the book Business Intelligence for the Enterprise by Mike Biere. It is posted with permission from publisher Prentice Hall PTR, in alliance with IBM Press, copyright 2003.

Data extraction, transformation and cleansing

The ABC enterprise has its current data held in some VSAM files and other sources. Company officials would like to begin constructing a customer-centric warehouse that includes the last five years of history by customer. Much of the detailed data has been archived to tape because it would take enormous amounts of disk space and they seldom use it. Currently, they haven't the disk capacity to store it.

Their data has some "problems." Customer information in the original systems was stored as a series of encoded numbers and text. Many of their applications use a look-up table (another file) to match the customer numbers with the appropriate names and other relevant information. Much of their historical data has inaccuracies, including numerous customers that they no longer service.

The company has been discussing a merger with another company, which has a different set of customer data on different databases and platforms.

They installed a relational database to hold some of the more current customer information and they joined the customer data to the look-up tables to provide a view of the customer in a format that their users like today.

They could just load the large volumes of historical data to their RDB (relational database) and perform the same joins, but because they have data integrity issues, they would populate the bad with the good. They need to address the missing or long-defunct customers that have no bearing on the data that they currently use.

Business Intelligence for the Enterprise
They resolve discrepancies with "on-the-fly" fixes to the new RDB based on business knowledge from their Customer Relations and Sales departments. They have not integrated some of the newer, advanced data cleansing technologies, so they are performing asynchronous data scrubbing.

If they clean out the defunct customers, they have less information and smaller volumes of data. However, they also want to use buying patterns, product sales, and market analysis such that some of the defunct information might be of value to them. They face these issues in such a scenario:

  • The sheer size of the data
  • Multiple forms and formats of the same data
  • Different users of parts of the data, making a universal delete or adjustment difficult
  • New data constantly being added so they can't "stop the presses"
  • Many manual adjustments

If this sounds like your environment, then rest assured that you are not a pioneer in this space. Things that seem trivial to the non-technician can be immensely time-consuming and complex for those who support you. What is required to find a record on a tape? The tape itself has to be physically mounted. You just have to hope that you have the correct tape volume because there is no fast-path to identifying where the record actually resides. The records on the tape are written sequentially and must be processed the same way: Read the first record; no, that's not it; read the second one; no, that's not it either; read the 250,000th; no, that's not it! Now, perform an action. You go to the next request and perhaps find that the record you need has already been passed, so you have to rewind and start over. This is just basic data processing.

What if you cannot accurately match customer records from different sources for any number of reasons? You may have switched systems and stored the data in different ways. It is very common to see online systems in which the data has been stored with terse, encoded strings representing business information that you pull out programmatically. You might have stored your customer data as a string in which the state, county, district, blood type, whatever is represented in a format such as A993029987534111DEFG. In positions 6 and 7, you store the state key. The value 29 represents the state of California (or is it the state of confusion?). Positions 16 and 17 keep departmental information. In our hypothetical case, 1D is the Catalog Sales department.

This storage technique is often referred to as using "intelligent keys" to store lengthy information in a more condensed format. What if you no longer had Catalog Sales, or the state code for California was changed to 2765 because of a new system you brought in? What if you decided to lump all former catalog sales into a new, generic bucket?

On top of all this confusion, we add the normal processing errors from data entry or other input mechanisms. I may have MR. J SMITH, MR. JOHN SMITH, MR. J. N. SMITH, MR. JONH SMIHT . . . and they are all the same person. He may have moved three times in the past five years and have changed jobs twice. Maybe it's a woman who has opted to change her last name or has decided to hyphenate both her last names to retain some family identity.

How do you combine the following into a single, coherent household for use in a CRM solution?

Mr. J SMITH A993029987534111DEFG

Mr. JOHN SMITH A993012987534111DEFG

Mrs. LILIAN JONES-SMITH A993029987534111KLFG

Mr. JOHN SMIHT A993011987534112NEFG

The answer is . . . very carefully! There are significant anomalies at hand here. The long strings beginning with A99 all have differences within them reflecting changes in location, etc. There are different permutations of the names, as well as a misspelling and a hyphenation to boot! Where do we even begin?

It is obvious that we need to automate as much of this data investigation and transforming work as possible. There will always be anomalies and other data integrity issues. We will always face the possibility of needing to merge new or different data sources into our current architecture. Most pressing of all are emerging new business reasons that no one anticipated but that force us to produce different analyses or business information.

Rather than take a longhand approach to these issues, we may decide to look at ETL (extract, transform, load) tools. The purpose of such offerings is to automate the steps and to offer sophisticated functions that allow us to somehow come up with a "JONES" household based on the gaggle of records shown earlier.

ETL tools allow us to construct documented and replicable processes/steps to be used over and over. They also allow us to add new information sources or modify the steps so we might have a faster turn-around in processing changes. Along with all these processes, we have to deal with the metadata associated with the information before, during, and after the steps have been completed.

Metadata capture and utilization is critical to the orderly flow and understanding of what has been and what has now become new information. No matter what vendors and tools you examine today, you can't "plop in and use" any of them. It is simply too complex an environment because of the issues surrounding data that we have just described.

Extraction technologies abound, and many are quite good. If we look at our example of the state of California represented by two different values, many extraction and transformation tools provide a simple way to move data from one source to the other and change its value on the fly. We could take the values of 29 or 2765 and create a California state value for those that match.

When it comes to data storage and assigning values, there are numerous schools of thought. States may be a bad example, but it is one to which many can relate. Because we have shorthand notation for states that is universally accepted (e.g., CA for California), why not store the CA as a value in a state column? Change the data to match the usage whenever possible, and see if you can curtail dragging along old baggage such as cryptic and "user-hostile" values.

Data cleansing may be the toughest component of all and the one that takes the greatest amount of time. On occasion, there are situations in which the data is simply beyond hope. It may not be impossible to cleanse all data properly, but there are impossible situations in which you simply cannot reconcile the values. Let me provide an example.

I encountered one situation as a consultant in which a customer had used a large number of temporary hires to enter data. The business also involved providing a large number of part-time employees to other firms. The temps entering data had to put in a valid social security number for the part-timers being utilized. There was no check-and-balance system, so many temps used their own SSN as part of the input data. Reconciling billing later proved to be nearly impossible in some circumstances.

When I attempted to build a file of valid SSNs matched to the part-time hires, it was nearly impossible to match anything that made sense. Overall data entry for this application was so poorly implemented that field after field had information that had been omitted, wrongly keyed in, or was misspelled to the point of embarrassment. It was literally impossible to ensure integrity for any set of records. Users with some PC tools and little IT skill had originally built the system. There were no consistent values within fields that any data-matching software could grab onto.

The only intelligent approach I could offer was to propose a solution whereby, from the first day of usage of the new system, we could verify and validate accuracy that would be as close to 100 percent foolproof as possible. Retrofitting old information in an automated fashion simply was not within the realm of possibility. The client was not overjoyed at this, but they attempted several alternatives and verification processes only to reach the same conclusion.

Data must be handled as a valuable and critical resource, not just some bits and bytes that get written on disks to be used and tossed later. Customer information is extremely valuable and expensive to accumulate. Business processes change, and along with them the data required to provide meaningful analysis changes. You have to consider how data will be used in the future or how it might be used in the business.

I worked with one customer who had built a data warehouse that was partially implemented and partially loaded. The end users had been given access to the semi-complete solution and had their appetite whetted just enough to begin to play with the data in its immature form. To compensate for some of the shortfall, they had to extract data back out of the warehouse. This was used to match to another set of extracted records created by a 4GL with which they had some experience.

The partially completed warehouse became a deathtrap. The changes and embellishments that the IT staff planned were going to dramatically affect the form and format of the existing partial solution. The users were in open rebellion because they had invested considerable time and energy using this hodgepodge solution to compensate for the shortfall. It also turned out that the user's enrichments being done outside the warehouse were not planned for the new version. The parties were at an impasse, and no one was even remotely happy.

This article is excerpted from the book Business Intelligence for the Enterprise by Mike Biere. It is posted with permission from publisher Prentice Hall PTR, in alliance with IBM Press, copyright 2003.

Special Report

The Future of BI

Stories in this report:

Copyright © 2004 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon