One of the more prosaic parts of data warehousing is, well, getting the data into the warehouse.
This has long been handled by vendors that are expert in the field of extract, transform and load (ETL). Even there, innovation focused more on the problem of transforming the data. Loading the data seemed a piece of cake by comparison.
That is, until business intelligence (BI) and analytics started becoming a round-the-clock affair. Also, today's biggest BI users -- banks, telecommunications providers, Web advertisers -- operate data warehouses larger than a petabyte in size and import huge swaths of data -- 50TB of data per day, as in the case of one of Teradata Inc.'s customers.
BI and ETL vendors are responding. The past several months have seen a number of start-ups and lesser-known firms touting screaming-fast data-loading speeds, both in the lab and in the field.
- Database start-up Greenplum Inc. said it has a customer routinely loading 2TB of data in half an hour, for an effective throughput of 4TB per hour.
- Rival database start-up Aster Data Systems Inc. claimed that its nCluster technology can enable customers to reach almost 4TB (specifically, 3.6TB) per hour.
- Data-integration vendor Syncsort Inc. said third-party-validated lab tests show its software can load 5.4TB of data into a Vertica Systems Inc. columnar data warehouse in under an hour.
- Not to be outdone, semantic data integration start-up Expressor Software Corp. claimed that in-house tests show its data-processing engine able to scale to nearly 11TB per hour.
"If they are really performing at this rate, it's quite significant and really impressive," said Jim Kobielus, an analyst at Forrester Research Inc., since "anything above a terabyte per hour is good."
Blazing past the incumbent BI and ETL vendors
What about the established firms? SAS Institute Inc. and Sun Microsystems Inc. two years ago demonstrated a SAS data warehouse running on Sun Microsystems hardware with StorageTek arrays that pushed through 1.7TB in 17 minutes, or the equivalent of nearly 6TB per hour.
But apart from SAS, other big-name vendors have posted data-integration performance benchmarks that fall well short of these upstarts.
- Three years ago, Informatica Corp. claimed its PowerCenter 8 software loaded data at a rate of 1.33 TB per hour. The company, which decline to comment today, hasn't posted any updated performance benchmarks.
- Oracle Corp. and Hewlett-Packard Co. last fall released the BI-oriented HP Oracle Database Machine, which they said loads data at up to 1TB per hour.
- Microsoft Corp. claimed at the launch of SQL Server 2008 a year ago that its SQL Server Integration Services 2008 had loaded the equivalent of 2.36TB in an hour.
So how do they do it?
Most of these faster data integrators rely on the same basic secret sauce: software that shreds the data to be loaded before it is delivered via fast networks to dozens of data warehousing servers or more running in a massively parallel grid.
That's how Greenplum's "scatter-gather streaming" technology works. The company's customer, Fox Interactive Media Inc., operator of MySpace.com, can load 2TB of Web usage data in half an hour into its 200TB Greenplum data warehouse, according to Ben Werther, director of product management at Greenplum.
To achieve 4TB/hour load speeds requires 40 Greenplum servers in a shared-nothing grid, Werther said. Doubling the servers doubles the load rate.
"The bigger your system gets, the faster it gets," Werther said.
Expressor Software's data processing engine is similarly scalable, according to John Russell, chief scientist at the Burlington, Mass.-based company, and companies can "buy and add channels as their performance needs increase."
A longtime data warehouse architect for Fortune 100 companies, Russell said he co-founded Expressor partly "out of the frustration I felt when dealing with the performance limitations and bottlenecks of those high-end DI tools."
Expressor's engine takes "full advantage" of 64-bit and multicore CPUs and massively parallel systems, Russell said. And the code is lean -- "only about 12,000 lines of that actually get executed at runtime," he said -- which helps Expressor achieve a top speed of 11TB per hour.
Aster Data's nCluster technology is slightly different. It relies on a dedicated tier of parallelized loading servers to achieve speeds of nearly 4TB per hour, according to Aster CEO Mayank Bawa.
Bawa said that assigning certain database servers in the grid to only load data frees up other database servers from this CPU-intensive task, boosting overall performance.
The black sheep is Syncsort. Unlike the other start-ups, Syncsort is a 41-year-old Woodcliff, N.J., company that started out as a mainframe software vendor.
Syncsort has 2,000 customers, including 525 for its DMExpress data integration software that was introduced in 2004. It has made a living by trying not to displace large data integration vendors, but to coexist with them.
"We are mostly brought in to solve ETL performance issues customers are having with Informatica or [IBM's] DataStage," said Ganesh Iyer, senior product marketing manager for Syncsort. "We have never lost a customer proof of concept because of performance."
Syncsort is also different, Iyer said, because unlike other vendors that rely on expensive, huge server grids, Syncsort's 5.4TB/hour benchmark was achieved last December using a $250,000 set of HP blade servers running a regular copy of its software, Iyer claimed.
"We did no tweaks. It was a trial version of the software, the same one we send out to customers," he said. Customers typically pay about $40,000 for DMExpress, which includes five years' worth of maintenance.
Do customers really have a need for speed?
Russell said customers are all looking for ultrafast loading performance.
"Every financial firm we talk with says they want ... something close to 1TB per day," he said. "For clickstream data [from Web sites], those figures could be as high 200 billion clicks, or nearly 24TB a day."
Curt Monash, an independent database analyst, disagrees. "I think most commercial data warehouses will provide most users with much more load speed than they actually need," Monash wrote in a blog last fall.
Even Teradata, which last fall introduced a 50 petabyte data warehousing appliance for ultralarge BI users, is skeptical.
"Extreme data load rates are typically irrelevant to most customer environments," said Randy Lea, vice president of products and services at Teradata. For one, customers can load data throughout the day rather than in narrow time windows, reducing the need for ultrafast batch loads, he argued.
Most data warehousing systems, including Teradata's, can be configured to load data at rates of multiple terabytes per hour, Lea said. The problem is that such systems are at risk of becoming unbalanced and performing badly in other areas.
Also, "the current crop of 'gee whiz' data loading boasts have little value because there are no benchmark standards," Lea said.
That issue is being addressed by the Transaction Processing Council (TPC), which is in the process of designing a new ETL benchmark.
Syncsort and Teradata are both on the development committee, which, according to Syncsort's Iyer, will meet for the first time next month.