Cleaning Out the Attic
Data classification tools offer policy-based management of data, freeing up primary storage.
Computerworld - Matt Decker, a contracted IT manager at the National Nuclear Security Administration, knew he couldn't continually add expensive high-end storage arrays to keep up with the agency's 40% annual data growth rate. And manually deleting recycle bins and temp files wasn't freeing up enough space.
"When data keeps growing, you suddenly become a slave to it," says Decker, who works for Honeywell Federal Manufacturing & Technologies LLC, which operated the plant.
Decker wanted to see the type of data that was filling up his high-end disk, so he could rate the value of it and determine where and how he should move it to cheaper storage media, either online or off-line.
Enter Mountain View, Calif.-based Arkivio Inc., which Decker hired two years ago to perform a data audit. He was shocked at what Arkivio found: The majority of the stored data was duplicate files, temporary files and e-mail attachments—3.5TB of it. "If someone sent an e-mail to me with an attachment I thought was neat, I'd save it, and so would everyone else who got it," says Decker.
|
Now, using Arkivio's Auto-xplor tool, Decker can automatically tag that data before it's backed up and set a policy engine to determine how to store it based on its importance.
"The software and hardware was expensive. But the way I see it... at our growth rate, it was going to be that much more expensive later," he says. "I'm looking at a material cost avoidance of close to $1 million in six years."
This sort of data classification, or tagging, used to be manual. But many start-up vendors are now selling tools that place agents on application servers to search volumes. The classification software then creates reports on those volumes and places that information in a database that can be searched.
For example, data classification software has fields such as "date created" and "date last accessed" and performs searches based on keywords. Administrators can then create policies that will determine where data should be stored once it's classified.
Companies such as Arkivio, Njini Inc. in London, Kazeon Systems Inc. in Mountain View, Calif., and StoredIQ Corp. in Austin have been early to market with software that can classify and store data across multiple applications, such as e-mail and file servers.
Carolyn Dicenzo, an analyst at Gartner Inc., says e-mail is the No. 1 offender for eating up space on primary storage arrays. Text files are No. 2. And this data can be risky to hold on to: When stored longer than necessary, e-mails can be difficult to wade through for legal discovery purposes and expose a company to litigation.
To date, data classification vendors have almost exclusively offered products for handling unstructured data, such as e-mail and text files. Structured data in databases doesn't need to be categorized, but there's a growing need to index that data so it, too, can be searched. The only company currently addressing structured data indexing is CopperEye Ltd. in Wiltshire, England, with its Greenwich software, says Steve Duplessie, an analyst at Enterprise Strategy Group Inc. in Milford, Mass.
Compliance-driven Effort
CDW Corp., a $5.7 billion technology reseller in Vernon Hills, Ill., expects to spend more than $1 million on the hardware and software needed to implement a data classification and tiered storage architecture. The goal is to better manage up to 250TB, much of which is on primary storage.
"For Fortune 500 companies, compliance issues have been a big deal for us this past year. All that turned our attention to records management and [information life-cycle management]," says K.C. Tomsheck, senior director of IT operations at CDW.
Tomsheck began implementing the data classification project in June. In the first phase, his legal department set policy definitions for how to treat different types of data. The project management office classified the data in the second phase, and in the final phase, the network engineering group will identify the technology to support a tiered storage architecture.
Tomsheck says the company's primary and backup data centers are both centrally located in Chicago, which helps tremendously in his data classification effort. "Databases, e-mail, file-shared documents, including unstructured data—it all resides on storage across two locations. That helps that we have data in one primary point and can evaluate it from there," he says.
The company purchased 12 EMC Corp. network-attached storage (NAS) arrays, including the Centera content-addressed storage array. If all goes as planned, about 150TB of data will be removed from primary storage arrays and placed onto the secondary NAS arrays. "We look at it as a 'pay me now or pay me later' proposition," says Tomsheck, who's hoping for a return on his investment in three to four years.
Duplessie notes that the cost of data classification isn't usually in the technology itself, but rather in the time spent determining how to categorize and classify the data.
As part of his strategy, Decker purchased an EMC Centera content-addressed storage array in order to archive e-mail and files online so end users can still access the data.
Read more about Storage in Computerworld's Storage Topic Center.



- Excel 2010 Cheat Sheet
- Register for this Computerworld Insider Cheat Sheet and gain access to hundreds of premium content articles, guides, product reviews and more.
- Datacenter Consolidation Best Practices Whitepaper
- The benefits of storage consolidation are being realized by companies and seen as a way to streamline many storage-driven applications. Learn why the...
- Eliminating VMware / Storage Related Performance Challenges
- How to proactively monitor the performance in a Fibre Channel SAN / vSphere environment is always a concern. Understand the importance of a...
- Cloud Environments Have Familiar Storage Challenges
- Cloud environments have many storage challenges that are familiar to data center managers, but due to their density and abstraction, the issues become...
- Eight Considerations for Evaluating Disk-Based Backup Solutions
- In the past, the movement from tape- to disk-based backup has been less compelling due to the expense of storing backup data on...
- ExaGrid Helps U.S. Federal Government Agencies Reduce Backup Windows and Improve Data Protection
- The U.S. Government has been the largest user of tape-based backup systems since the 1970s. Most agencies have begun to deploy disk storage... All Storage White Papers
- Understand Your Data: The Future of Backup and Archiving
- Archiving and Backup are the foundation of the next generation of information governance. However, commodity data protection tools and basic archives are only...
- Optimizing Networks for the Cloud
- Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
- Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
- Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
- Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
- Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
- Customer Spotlight: How IPC The Hospitalist Company Implemented Oracle on VMware
- Have you been looking to hear about customer's experiences with the new VMware vCenter Site Recovery Manager product? View this webcast to learn... All Storage Webcasts