Cleaning Out the Attic

Data classification tools offer policy-based management of data, freeing up primary storage.

Matt Decker, a contracted IT manager at the National Nuclear Security Administration, knew he couldn't continually add expensive high-end storage arrays to keep up with the agency's 40% annual data growth rate. And manually deleting recycle bins and temp files wasn't freeing up enough space.

"When data keeps growing, you suddenly become a slave to it," says Decker, who works for Honeywell Federal Manufacturing & Technologies LLC, which operated the plant.

Decker wanted to see the type of data that was filling up his high-end disk, so he could rate the value of it and determine where and how he should move it to cheaper storage media, either online or off-line.

Enter Mountain View, Calif.-based Arkivio Inc., which Decker hired two years ago to perform a data audit. He was shocked at what Arkivio found: The majority of the stored data was duplicate files, temporary files and e-mail attachments—3.5TB of it. "If someone sent an e-mail to me with an attachment I thought was neat, I'd save it, and so would everyone else who got it," says Decker.


Simply Put

DATA CLASSIFICATION TOOLS automatically tag data prior to backup and use a policy engine to determine how to store it based on its importance to the business. But most of these tools address only unstructured data, like that created by e-mail and file-serving applications, not database records.


Now, using Arkivio's Auto-xplor tool, Decker can automatically tag that data before it's backed up and set a policy engine to determine how to store it based on its importance.

"The software and hardware was expensive. But the way I see it... at our growth rate, it was going to be that much more expensive later," he says. "I'm looking at a material cost avoidance of close to $1 million in six years."

This sort of data classification, or tagging, used to be manual. But many start-up vendors are now selling tools that place agents on application servers to search volumes. The classification software then creates reports on those volumes and places that information in a database that can be searched.

For example, data classification software has fields such as "date created" and "date last accessed" and performs searches based on keywords. Administrators can then create policies that will determine where data should be stored once it's classified.

Companies such as Arkivio, Njini Inc. in London, Kazeon Systems Inc. in Mountain View, Calif., and StoredIQ Corp. in Austin have been early to market with software that can classify and store data across multiple applications, such as e-mail and file servers.

Carolyn Dicenzo, an analyst at Gartner Inc., says e-mail is the No. 1 offender for eating up space on primary storage arrays. Text files are No. 2. And this data can be risky to hold on to: When stored longer than necessary, e-mails can be difficult to wade through for legal discovery purposes and expose a company to litigation.

To date, data classification vendors have almost exclusively offered products for handling unstructured data, such as e-mail and text files. Structured data in databases doesn't need to be categorized, but there's a growing need to index that data so it, too, can be searched. The only company currently addressing structured data indexing is CopperEye Ltd. in Wiltshire, England, with its Greenwich software, says Steve Duplessie, an analyst at Enterprise Strategy Group Inc. in Milford, Mass.

Compliance-driven Effort

CDW Corp., a $5.7 billion technology reseller in Vernon Hills, Ill., expects to spend more than $1 million on the hardware and software needed to implement a data classification and tiered storage architecture. The goal is to better manage up to 250TB, much of which is on primary storage.

"For Fortune 500 companies, compliance issues have been a big deal for us this past year. All that turned our attention to records management and [information life-cycle management]," says K.C. Tomsheck, senior director of IT operations at CDW.

Tomsheck began implementing the data classification project in June. In the first phase, his legal department set policy definitions for how to treat different types of data. The project management office classified the data in the second phase, and in the final phase, the network engineering group will identify the technology to support a tiered storage architecture.

Tomsheck says the company's primary and backup data centers are both centrally located in Chicago, which helps tremendously in his data classification effort. "Databases, e-mail, file-shared documents, including unstructured data—it all resides on storage across two locations. That helps that we have data in one primary point and can evaluate it from there," he says.

The company purchased 12 EMC Corp. network-attached storage (NAS) arrays, including the Centera content-addressed storage array. If all goes as planned, about 150TB of data will be removed from primary storage arrays and placed onto the secondary NAS arrays. "We look at it as a 'pay me now or pay me later' proposition," says Tomsheck, who's hoping for a return on his investment in three to four years.

Duplessie notes that the cost of data classification isn't usually in the technology itself, but rather in the time spent determining how to categorize and classify the data.

As part of his strategy, Decker purchased an EMC Centera content-addressed storage array in order to archive e-mail and files online so end users can still access the data. 

Special Report

Battling Complexity

Stories in this report:


Copyright © 2005 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon