Data deduplication: Reducing storage bloat

Data storage needs continue to grow unabated, straining backup and disaster recovery systems while requiring more online spindles, using more power, and generating more heat. No one expects a respite from this explosion in data growth. That leaves IT profession­als to search for technology solutions that can at least lighten the load.

One solution particularly well-suited to backup and disaster recovery is data deduplication, which takes advantage of the enormous amount of redundancy in business data. Eliminating duplicate data can reduce the amount of storage space necessary from a 10:1 ratio to a 50:1 ratio and beyond, depending on the technology used and the level of redundancy. With a little help from data deduplication, admins can reduce costs, lighten backup requirements, and accelerate data restoration in the event of an emergency.

[ Get the full scoop on keeping your storage under control in the InfoWorld "Data Deduplication Deep Dive" PDF special report. | Better manage your company's information overload with our Enterprise Data Explosion newsletter. ]

Deduplication takes several different forms, each with its own approach and optimal role in backup and disaster recovery scenarios. Ultimately, few doubt that data deduplication technology will extend beyond the backup tier and apply its benefits across business storage systems. But first, let's take a look at why data deduplication has become so attractive to so many organizations.

Too much data, too little time Duplicated data is strewn all over the enterprise. Files are saved to a file share in the data center, with other copies located on an FTP server facing the Internet, and yet another copy (or two) located in users' personal folders. Sometimes copies are made as a backup version prior to exporting to another system or updating to new software. Are users good about deleting these extra copies? Not so much.

A classic example of duplicate data is the email blast. It goes like this: Someone in human resources wants to send out the new Internet acceptable use policy PDF to 100 users on the network. So he or she creates an email, addresses it to a mailing list, attaches the PDF, and presses Send. The mail server now has 100 copies of the same attachment in its storage system. Only one copy of the attachment is really necessary, yet with no deduplication system in place, all the copies sit in the mail store taking up space.

Server virtualization is another area rife with duplicate data. The whole idea of virtualization is to "do more with less" and maximize hardware utilization by spinning up multiple virtual machines in one physical server. This equates to less hardware expense, lower utility costs, and (hopefully) easier management.

Each virtualized server is contained in a file. For instance, VMware uses a single VMDK (virtual machine disk) file as the virtual hard disk for the virtual machine. As you would expect, VMDK files tend to be rather large -- at least 2GB in size, and usually much larger.

One of the great features of virtual machines is that admins can stop the VM, copy the VMDK file, and back it up. Simply restart the machine and you're back online. Now what happens with all of these backup copies? That's right -- a lot of duplicated files stored on a file server. Admins keep "golden images" of working virtual servers to spawn new virtual machines -- not to mention the backup copies. Virtualization is a fantastic way to get the most out of CPU and memory, but without deduplication, virtual hard disks can actually increase network storage requirements.

Straining backup systems How do you back up all this data? Old tape backup systems are too slow and lack the needed capacity. New high-end tape systems have the performance and capacity but are quite expensive. And no matter how good your tape drive is, Murphy's Law has a tendency to jump all over tape when it comes to restoration.

VTLs (virtual tape libraries) provide a modern alternative to tape, using hard disks in configurations that mimic standard tape drives. But at what cost? Additional spindles equal additional cost and additional power consumption. VTLs are fast and provide a reliable backup and restore destination, but if there were less data to back up, you'd have lower hardware and operating costs to begin with.

Data glut compounds the difficulty of disaster recovery, making each stage of near line and offline storage more expensive. Keeping a copy of the backup in near line storage makes restoration of missing or corrupt files easy. But depending on the backup set size and the number of backup sets admins want to keep handy, your near line storage can be quite substantial. The next tier, offline storage, is composed of tapes or other media copies that get thrown in a vault or sent to some other secure location. Again, if the data set is large and growing, this offline media set must expand to fit.

Many disaster recovery plans include sending the backup set to another geographical location over a WAN. Unless your company has deep pockets and can afford a very fast WAN link, it would be beneficial to keep the size of the backup set to a minimum. That goes double for restoring data. If the set is really large, trying to restore from an off-site backup will add downtime and frustration.

Defining data deduplication and its benefits Simply put, deduplication is the process of detecting and removing duplicate data from a storage medium or file system. Detection of duplicate data may be performed at the file, bit, or block level, depending on the type and aggressiveness of the deduplication process.

The first time a deduplication system sees a file or a chunk of file, that data element is identified. Thereafter, each subsequent identical item is removed from the system but marked with a small placeholder. The placeholder points back to the first instance of the data chunk so that the deduped data can be reassembled when needed.

This deduplication process reduces the amount of storage space needed to represent all of the indexed files in the system. For example, a file system that has 100 copies of the same document from HR in each employee's personal folder can be reduced to a single copy of the original file plus 99 tiny placeholders that point back to the original file. It's easy to see how that can vastly reduce storage requirements -- as well as why it makes much more sense to back up the deduped file system instead of the original file system.

Another benefit of data deduplication is the ability to keep more backup sets on near line storage. With the amount of backup disk space reduced, more "point in time" backups can be kept ready on disk for faster and easier file restoration. This also allows you to maintain a longer backup history. Instead of having three versions of the file to restore, users can have many more, enabling a very granular approach to file backups and accommodating loads of backup history.

Disaster recovery is another process that greatly benefits from data deduplication. For years, data compression was the only way to reduce the overall size of the off-site data set. Add in deduplication and the backup set can be reduced even more. Why transfer the same data set each night when only a small portion of it changed that day? Deduplication in disaster recovery makes perfect sense: Not only is the transfer time reduced, but the WAN is used more efficiently with less overall traffic.

Read more about how manage data deduplication in InfoWorld's free PDF report, "Data Deduplication Deep Dive," including:

  • How data deduplication really works
  • File-, bit-, and block-level deduplication compared
  • Source, target, and inline deduplicatin compared
  • Deduping beyond the backup tier

This article, "Data deduplication: Reducing storage bloat," was originally published at InfoWorld.com. Follow the latest developments in information management at InfoWorld.com.

Read more about storage in InfoWorld's Storage Channel.

This story, "Data deduplication: Reducing storage bloat" was originally published by InfoWorld.

Copyright © 2010 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon