Skip the navigation

Cleaning Out the Attic

Data classification tools offer policy-based management of data, freeing up primary storage.

By Lucas Mearian
October 17, 2005 12:00 PM ET

Computerworld - Matt Decker, a contracted IT manager at the National Nuclear Security Administration, knew he couldn't continually add expensive high-end storage arrays to keep up with the agency's 40% annual data growth rate. And manually deleting recycle bins and temp files wasn't freeing up enough space.


"When data keeps growing, you suddenly become a slave to it," says Decker, who works for Honeywell Federal Manufacturing & Technologies LLC, which operated the plant.


Decker wanted to see the type of data that was filling up his high-end disk, so he could rate the value of it and determine where and how he should move it to cheaper storage media, either online or off-line.


Enter Mountain View, Calif.-based Arkivio Inc., which Decker hired two years ago to perform a data audit. He was shocked at what Arkivio found: The majority of the stored data was duplicate files, temporary files and e-mail attachments—3.5TB of it. "If someone sent an e-mail to me with an attachment I thought was neat, I'd save it, and so would everyone else who got it," says Decker.


















Simply Put


DATA CLASSIFICATION TOOLS automatically tag data prior to backup and use a policy engine to determine how to store it based on its importance to the business. But most of these tools address only unstructured data, like that created by e-mail and file-serving applications, not database records.


Now, using Arkivio's Auto-xplor tool, Decker can automatically tag that data before it's backed up and set a policy engine to determine how to store it based on its importance.


"The software and hardware was expensive. But the way I see it... at our growth rate, it was going to be that much more expensive later," he says. "I'm looking at a material cost avoidance of close to $1 million in six years."


This sort of data classification, or tagging, used to be manual. But many start-up vendors are now selling tools that place agents on application servers to search volumes. The classification software then creates reports on those volumes and places that information in a database that can be searched.


For example, data classification software has fields such as "date created" and "date last accessed" and performs searches based on keywords. Administrators can then create policies that will determine where data should be stored once it's classified.


Companies such as Arkivio, Njini Inc. in London, Kazeon Systems Inc. in Mountain View, Calif., and StoredIQ Corp. in Austin have been early to market with software that can classify and store data across multiple applications, such as e-mail and file servers.


Carolyn Dicenzo, an analyst at Gartner Inc., says e-mail is the No. 1 offender for eating up space on primary storage arrays. Text files are No. 2. And this data can be risky to hold on to: When stored longer than necessary, e-mails can be difficult to wade through for legal discovery purposes and expose a company to litigation.
To date, data classification vendors have almost exclusively offered products for handling unstructured data, such as e-mail and text files. Structured data in databases doesn't need to be categorized, but there's a growing need to index that data so it, too, can be searched. The only company currently addressing structured data indexing is CopperEye Ltd. in Wiltshire, England, with its Greenwich software, says Steve Duplessie, an analyst at Enterprise Strategy Group Inc. in Milford, Mass.


Compliance-driven Effort


CDW Corp., a $5.7 billion technology reseller in Vernon Hills, Ill., expects to spend more than $1 million on the hardware and software needed to implement a data classification and tiered storage architecture. The goal is to better manage up to 250TB, much of which is on primary storage.


"For Fortune 500 companies, compliance issues have been a big deal for us this past year. All that turned our attention to records management and [information life-cycle management]," says K.C. Tomsheck, senior director of IT operations at CDW.


Tomsheck began implementing the data classification project in June. In the first phase, his legal department set policy definitions for how to treat different types of data. The project management office classified the data in the second phase, and in the final phase, the network engineering group will identify the technology to support a tiered storage architecture.


Tomsheck says the company's primary and backup data centers are both centrally located in Chicago, which helps tremendously in his data classification effort. "Databases, e-mail, file-shared documents, including unstructured data—it all resides on storage across two locations. That helps that we have data in one primary point and can evaluate it from there," he says.


The company purchased 12 EMC Corp. network-attached storage (NAS) arrays, including the Centera content-addressed storage array. If all goes as planned, about 150TB of data will be removed from primary storage arrays and placed onto the secondary NAS arrays. "We look at it as a 'pay me now or pay me later' proposition," says Tomsheck, who's hoping for a return on his investment in three to four years.


Duplessie notes that the cost of data classification isn't usually in the technology itself, but rather in the time spent determining how to categorize and classify the data.


As part of his strategy, Decker purchased an EMC Centera content-addressed storage array in order to archive e-mail and files online so end users can still access the data. 


Read more about Storage in Computerworld's Storage Topic Center.



Additional Resources
Forrester Consulting - Optimizing Users and Applications in a Mobile World
WHITE PAPER
Solving application issues over the WAN requires careful consideration. Based on their independent research, Forrester Consulting offers recommendations on how to tackle application performance issues, insufficient bandwidth and the inability to quickly restore users in a disaster.

Read now.

Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Storage White Papers
Datacenter Consolidation Best Practices Whitepaper
The benefits of storage consolidation are being realized by companies and seen as a way to streamline many storage-driven applications. Learn why the...
Eliminating VMware / Storage Related Performance Challenges
How to proactively monitor the performance in a Fibre Channel SAN / vSphere environment is always a concern. Understand the importance of a...
Cloud Environments Have Familiar Storage Challenges
Cloud environments have many storage challenges that are familiar to data center managers, but due to their density and abstraction, the issues become...
Eight Considerations for Evaluating Disk-Based Backup Solutions
In the past, the movement from tape- to disk-based backup has been less compelling due to the expense of storing backup data on...
ExaGrid Helps U.S. Federal Government Agencies Reduce Backup Windows and Improve Data Protection
The U.S. Government has been the largest user of tape-based backup systems since the 1970s. Most agencies have begun to deploy disk storage...
All Storage White Papers
Storage Webcasts
Understand Your Data: The Future of Backup and Archiving
Archiving and Backup are the foundation of the next generation of information governance. However, commodity data protection tools and basic archives are only...
Optimizing Networks for the Cloud
Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
Customer Spotlight: How IPC The Hospitalist Company Implemented Oracle on VMware
Have you been looking to hear about customer's experiences with the new VMware vCenter Site Recovery Manager product? View this webcast to learn...
All Storage Webcasts
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs