Data Cubes

When we try to extract information from a stack of data, we need tools to help us find what's relevant and what's important and to explore different scenarios. A report, whether printed on paper or viewed on-screen, is at best a two-dimensional representation of data, a table using columns and rows. That's sufficient when we have only two factors to consider, but in the real world we need more powerful tools.

More QuickStudies

Data cubes are multidimensional extensions of 2-D tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.

But data cubes aren't restricted to just three dimensions. Most online analytical processing (OLAP) systems can build data cubes with many more dimensions—Microsoft SQL Server 2000 Analysis Services, for example, allows up to 64 dimensions. We can think of a 4-D data cube as consisting of a series of 3-D cubes, though visualizing such higher-dimensional entities in spatial or geometric terms can be a problem.

In practice, therefore, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions.

Relational or Multidimensional?

Since data cubes are such a useful interpretation tool, most OLAP products are built around a structure in which the cube is modeled as a multidimensional array. These multidimensional OLAP, or MOLAP, products typically run faster than other approaches, primarily because it's possible to index directly into the data cube's structure to collect subsets of data.

However, for very large data sets with many dimensions, MOLAP solutions aren't always so effective. As the number of dimensions increases, the cube becomes sparser—that is, many cells representing specific attribute combinations are empty, containing no aggregated data. As with other types of sparse databases, this tends to increase storage requirements, sometimes to unacceptable levels. Compression techniques can help, but using them tends to destroy MOLAP's natural indexing.

Data cubes can be built in other ways. Relational OLAP uses the relational database model. The ROLAP data cube is implemented as a collection of relational tables (up to twice as many as the number of dimensions) instead of as a multidimensional array. Each of these tables, called a cuboid, represents a particular view.

Because the cuboids are conventional database tables, we can process and query them using traditional RDBMS techniques, such as indexes and joins. This format is likely to be efficient for large data collections, since the tables must include only data cube cells that actually contain data.

However, ROLAP cubes lack the built-in indexing of a MOLAP implementation. Instead, each record in a given table must contain all attribute values in addition to any aggregated or summary values. This extra overhead may offset some of the space savings, and the absence of an implicit index means that we must provide one explicitly.

Other Considerations

From a structural perspective, data cubes are made up of two elements: dimensions and measures. I've already explained dimensions; measures are simply the actual data values.

It's important to keep in mind that the data in a data cube has already been processed and aggregated into cube form. Thus we normally don't perform calculations within a data cube. This also means that we're not looking at real-time, dynamic data in a data cube.

The data contained within a cube has already been summarized to show figures such as unit sales, store sales, regional sales, net sale profits and average time for order fulfillment. With this data, an analyst can efficiently analyze any or all of those figures for any or all products, customers, sales agents and more. Thus data cubes can be extremely helpful in establishing trends and analyzing performance. In contrast, tables are best suited to reporting standardized operational scenarios.

Building a Data Cube
This example uses sales figures from XYZ Co., which makes many kinds of widgets. For each sales transaction, we know four pieces of data:

red_bullet.gif
Which types of widget were involved (style, color, size and so on)
red_bullet.gif
Store or sales agent
red_bullet.gif
Sales amount
red_bullet.gif
Geographic region or territory

In a real-world situation, we would also know many other data items, including:

red_bullet.gif
Quantity

red_bullet.gif
Customer
red_bullet.gif
Cost to XYZ for each widget
red_bullet.gif
Order date
red_bullet.gif
Shipment date
red_bullet.gif
Method and cost of shipping

Any of these pieces of data can function as a dimension in a data cube. We can take any two dimensions and produce a 2-D table 1. Thus we can correlate or track sales against individual stores or sales agents. Add in a third factor, such as price, and we can produce a 3-D data cube 2 that allows us to see how much each store or sales agent is selling in addition to which type of widget. Swap in geography 3, and we can now see who is selling where.

Building a Data Cube

Kay is a Computerworld contributing writer in Worcester, Mass. You can reach him at russkay@charter.net.

Are there technologies or issues you'd like to learn about in QuickStudy? Send your ideas to quickstudy@computerworld.com.

To find a complete archive of our QuickStudies, go online to computerworld.com/quickstudies.

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Join the discussion
Be the first to comment on this article. Our Commenting Policies