Lose Unwanted Gigabytes Overnight!

Data de-duplication technology eliminates redundant versions of the same information, yielding a dramatic reduction in backup data.

Like overstuffed closets, cluttered enterprise backup operations scream for attention. Fortunately, vendors are coming out with data de-duplication functions — packed into storage software suites or in stand-alone appliances — that sort through data destined for the archives and eliminate the redundancies.

Analysts say the technology can provide a 20-to-1 reduction of backup data. In other words, 20TB of original data can be shrunk to 1TB for backup purposes.

Eliminating duplicate data seems like a no-brainer, but in the past, corporations were leery of losing data on its way to backup repositories. Only now are they getting comfortable with the reliability of de-duplication technology, which has matured thanks to advancements in data transfer techniques and standards. Specifically, the rise of Advanced Technology Attachment and Serial ATA technologies, along with huge spikes in processing power, have fostered better de-duplication functionality.

Suddenly, de-duplication is catching on big time, attracting big-name vendors such as EMC Corp. and Symantec Corp. In November, EMC acquired de-duplication vendor Avamar Technologies Inc., and now EMC is incorporating de-duplication into its Clariion, Centera and NetWorker product lines. Meanwhile, Symantec is reportedly scrambling to inject de-duplication capability into its Veritas NetBackup storage management software.

The premise behind de-duplication is as fundamental as it sounds. “Imagine having a Word document that was several megabytes in size. If you e-mailed that to a colleague who then added one word to that document, some [systems] would determine that this was a new document that needed to be backed up again,” says Jason Paige, information systems manager at Integral Capital Partners, an investment firm in Menlo Park, Calif.

To make sure files such as Word documents with minor tweaks aren’t stored several times over, ICP uses Avamar’s de-duplication technology.

Corporate IT’s comfort level with the technology has increased to the point where some IT executives wonder whether de-duplication could extend from backup operations to disaster recovery and even primary storage. But first there are lingering questions about where best to insert de-duplication functionality in the backup process: at the client, at the disk or at the virtual tape library (VTL).

IT managers will have to ask vendors hard questions, because de-duplication methods vary significantly by vendor. “There is still a lot of confusion in the market about what data de-duplication is and isn’t — and where it is best done. This confusion can delay adoption,” says Heidi Biggar, an analyst at Enterprise Strategy Group Inc. in Milford, Mass.

But whatever confusion exists, corporate IT shops shouldn’t be stumped for too long. “There are pros and cons to each approach, but all have potentially significant benefits for users by allowing them to reduce the amount of [storage] capacity they need on the back end,” Biggar says. The benefits extend to other areas, too. For example, de-duplication can reduce the network bandwidth required for long-distance data replication, she says.

Where to De-dupe

Data de-duplication can take place either at the source or at the point where data is being written to disk systems or VTLs. “The packaging of this functionality can occur in three ways: as software, which can be stand-alone or integrated with the backup software; as a disk gateway or disk array; and, lastly, as a VTL,” explains David Russell, an analyst at Gartner Inc.

Avamar and Toronto-based Asigra Inc. take the first route by performing de-duplication in backup and recovery software running on a protected server — before sending the data across the network to backup repositories.

Paige, an Avamar user, explains the process this way: “Scheduled jobs start at predetermined times of the day to ‘snap up’ the data. During this snap-up, the client software compares the data that is located on that client with the data that resides on the server — and only transfers the new or changed data to the server. This allows us to transfer large virtual data sets using very little bandwidth.”

Some vendors take a second approach, relying on in-line, disk-based products that expunge duplicate data after it’s shipped to a disk repository. Vendors using this method include Data Domain Inc., Diligent Technologies Corp. and ExaGrid Systems Inc.

Michael Bailess, network administrator at American National Bank and Trust Co. in Danville, Va., is an ExaGrid appliance customer. “Our [system] takes data from the staging area to the repository. Only those files with changes are then stored in the repository,” Bailess says. The result is a “huge reduction in the amount of files we store,” he says.

“The way ExaGrid handles data reduction meant that we were able to purchase a much smaller [storage] system than we would have with other types of products,” Bailess says, adding that the ExaGrid product cost his company less than $20,000.

Data Domain, which uses an approach similar to ExaGrid’s, says its appliance can spread the benefits of de-duplication to geographically dispersed sites. That was a key selling point for Troutman Sanders LLP, an Atlanta-based law firm.

“We have 15 offices but were able to quickly get de-duplication services down pat, since the device is shipped to the location and replication is done locally,” says IT manager John Thomas.

The de-duplication investment really paid off, says Thomas. “We have been able to take 165TB of data and store 4TB. We are also able to produce instantaneous responses for users looking for archived data without having to deal with our tape-handling system,” he says.

The third approach to data de-duplication is employed by vendors such as FalconStor Software Inc., Quantum Corp. and Sepaton Inc. These vendors offer data de-duplication as extensions of their VTL systems and perform the task outside of the backup process.

The systems in this category write all data to the VTL and then run a de-duplication process after the fact. This method ensures that de-duplication won’t interfere with backup operations and has zero impact on backup windows and operations, the vendors claim.

The Trade-offs

It all sounds easy, but of course there are complications and trade-offs. For example, regardless of the chosen de-duping method, “performance degradation can be an issue,” says Enterprise Strategy Group’s Biggar.

Specifically, companies performing de-duplication at the source run the risk that this function will get in the way of the primary task at hand — protecting data as it is readied for offloading to backup storage systems. “The potential disadvantage here is that de-duplication can steal memory cycles from the backup servers,” Gartner’s Russell says.

After-the-fact de-duplication poses challenges, too. More upfront capacity is required to store data that will be de-duped in postprocessing, says Biggar. “However, capacity is released after the de-duplication is complete,” she notes.

But Biggar is quick to add that any trade-offs pale in comparison to the benefits of de-duplication. Her conclusion: “ESG Labs has tested several vendors’ de-duplication technologies and has had no issue implementing or using the technologies, and we have substantiated vendors’ data reduction claims. In general, the benefits of data de-duplication far outweigh any negatives.”

McAdams is a freelance writer in Vienna, Va. Contact her at JMTechWriter@aol.com.

Copyright © 2007 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon