ETL: Preparation is the best bet

Extract, transform and load software is the foundation of successful data warehouses.

When you've got half a dozen riverboat gambling operations, it's important that everyone plays by the same rules. Argosy Gaming Co., with headquarters in Alton, Ill., and a fleet of six riverboat casinos, had decided that bringing all customer data together would enhance management's view of operations and potentially help strengthen customer relationships. To accomplish those goals, though, the company needed to access disparate databases and put in place an extract, transform and load (ETL) system to help populate and maintain a central data warehouse.

Jason Fortenberry, a data-warehousing analyst, came aboard at Argosy just as the company's data warehouse project started in 2001. His job was made easier, he says, by the adoption of Toronto-based Hummingbird Ltd.'s Genio ETL tool, which helped bridge systems and automate processes. But like others going through ETL projects, he learned the hard way that preparing for ETL is just as important as having the right tools.

The riverboats each had unique and incompatible ways of defining a host of operational activities and customer characteristics -- in essence, they were each playing the same game but with different rules. But those problems remained hidden until reports from the company's Teradata data warehouse began to turn up inconsistent or troubling data. That's when Fortenberry and his staff discovered conflicting definitions for a wide range of data types -- problems he wishes he had identified much earlier.

Fortenberry's troubles -- and his successes -- are typical of ETL, the complex and often expensive prelude to data warehouse success. ETL is often problematic because of its inherent complexity and underlying business challenges, such as making sure you plan adequately and have quality data to process.

Analysts, users and even vendors say all bets are off if you don't have a clear understanding of your data and what you want to achieve with it. Then there are choices, like whether to go for a centralized architecture -- the simplest and most common configuration -- or a distributed system, with ETL processing spread across sort tools, system utilities and target databases, which is sometimes a necessity in larger, more complicated data warehouses. Even if you navigate those waters successfully, you still need to ensure that the ETL foundation you build for your data warehouse can scale to meet growing data streams and future information demands.

As the term implies, ETL involves extracting data from various sources, transforming it (usually the trickiest part) and loading it into the data warehouse. A transformation could be as simple as a reordering the fields of a record from a source system.

But, as Philip Russom, a Giga Information Group Inc. analyst explains, a data warehouse often contains data values and data structures that never existed in a source system. Since many analytic questions a business user would ask of a data warehouse can be answered only with calculated values (like averages, rankings or metrics), the ETL tool must calculate these from various data sources and load them into the warehouse.

Similarly, notes Russom, a data warehouse typically contains "time-series data." The average operational application keeps track of the current state of a value such as a bank account balance. It's the job of the ETL tool to regularly append new states of a value to the series.

ETL is generally not cheap. Russom estimates that for smaller projects, ETL tools range from "free" (such as the Data Transformation Services feature in Microsoft Corp.'s SQL Server) to about $35,000 for DT/Studio from Embarcadero Technologies Inc. in San Francisco. But for tools that can handle enterprise-scope numbers, diverse sources and targets, plus complex transformations, Russom says to expect entry-level pricing of $75,000 to $100,000. "By the time you've added necessities like connectors to specialized sources and targets or extra developer seats, the average is more like $200,000 per year," he says.

The initial costs of ETL tools will almost always be higher than the initial costs of hand-coding, which was the normal approach to ETL through the mid-1990s. "With a hand-coded solution, entry costs are very low," Russom says. However, over time, the economics of the two approaches invert; hence, it's best to look at total ETL costs over a five-year period.

The Hard Part

"We see ETL as perhaps the most critical component in the data warehouse because it includes data integration and cleansing and acquisition -- the hard parts of the project," says Jim Keene, enterprise data warehouse senior project manager at Milwaukee-based Harley-Davidson Inc. "Once it is in the warehouse and the schema is set, it is pretty easy to build views or apply an OLAP [online analytical processing] template," he adds.

Harley-Davidson asked Keene to create a comprehensive data warehouse from a tangle of loosely knit applications held together with an ETL patchwork of SQL and Cobol code.

"That had become harder and harder to use, given the flexibility we needed," says Keene. By comparison, ETL data integration tools like the one he selected, Data Stage from Westboro, Mass.-based Ascential Software Corp., "come out of the box with standard transformations, connectivity to many sources and a visual interface that makes developer productivity very high," he says.

Although he had to add three people to get the project going, Keene says, "we've seen a payback already in terms of labor savings -- now, I don't have to write a lot of SQL and Cobol." Based on his experience, he adds, it's important to pick a strong partner or vendor that has a top-tier selection of tools.

For his part, Gary Feierstein says he wishes his organization had learned the ETL tools lesson earlier. Feierstein is senior director of IT at Premier Healthcare Inc., a Charlotte, N.C.-based company that agglomerates information for scores of clients in the health care field.

Years ago, the company set up its first data warehouse using homegrown tools. Year by year, the process of adapting the tools to new ways of doing business and to the needs of new customers began to take more and more time and effort.

In the summer of 2001, costs were so high that Feierstein was assigned the job of re-engineering the ETL process to make it more scalable and cost-effective. On the advice of Chicago-based consulting firm Knightsbridge Solutions LLC, he selected the Co-Operating System from Ab Initio Software Corp. in Lexington, Mass.

"The key was that it was robust, capable of working across multiple platforms, and you can add hardware and keep processing," says Feierstein.

The ETL deployment went smoothly and stayed close to schedule, with a few "bumps and bruises," says Feierstein. "The key to any new ETL development is the learning curve," he explains. Critical issues include having a strong quality assurance environment and building in scalability.

"Ab Initio offers the ability to scale our product by adding hardware without rewriting the software code," says Feierstein. In addition, he is able to use a distributed approach to ETL. He can divide processing tasks and run different processes on different servers and then bring the solution set back together for the final output.

Asking the Right Questions

For his yearlong ETL project, which began with a clean slate, Argosy's Fortenberry says Hummingbird's Genio Suite, a data integration and ETL tool, quickly became the project's "central nervous system," coordinating the process for extracting source data and loading the warehouse.

He says Genio can handle almost any extraction task in databases it supports, such as Microsoft SQL Server.

But Fortenberry also had to work with a database from Progress Software Corp. in Lexington, Mass., which Genio doesn't support. To handle data from that database, Genio launches a Perl script that makes a Telnet connection to the Progress server and executes a Progress utility to extract the data. After this step, Genio, using its internal support for the file transfer protocol, moves the data from the Progress server into the ETL process.

Fortenberry says Genio handles all the transform work except address data -- for that, it launches verification software from Trillium Software in Billerica, Mass. And he says sometimes there are faster alternatives than running the data through the Genio engine, and Genio has the flexibility to support this. "We have a few Perl scripts for transformations like this that Genio launches and then picks up the output to continue along the ETL process," he says.

But for Argosy, getting all that data into the warehouse didn't produce immediate gratification. "The lesson was that people thought that they were talking about the same thing, but they actually were not," says Fortenberry.

For example, he explains, riverboats calculated visits differently. One property would credit a customer with a visit only if he actually played at a slot machine or table. Another had an expanded definition and credited customers with visits when they redeemed coupons, even if they didn't play. So identical customer activity might have one riverboat reporting four player visits and another reporting 10. "This type of discovery was repeated for everything from defining what a 'player' is to calculating a player's profitability," says Fortenberry.

IT played a lead role in identifying problems and helping to hammer out a consensus among the business units about how to define and use many categories of data, he says.

Now, with the data warehouse up and running smoothly, the number of problem-resolution meetings has dropped dramatically. Still, Fortenberry reckons that three quarters of the meetings he attends nowadays have a business focus. "For our part, we now know better what questions to ask going into the process," he says.

Earls is a freelance writer in Franklin, Mass.

Copyright © 2003 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon