Demystifying de-duplication

Source de-duplication is set to be "more disruptive" than previous technologies

1 2 Page 2
Page 2 of 2

Other products can handle the backup stream and de-dupe in band, in real time. Target vendors that de-dupe in band include Data Domain and Diligent. All source vendors de-dupe in band as well. Paradoxically, the in-band vendors are able to sustain full backup stream performance.

The unit of de-dupe granularity, called chunk size, further differentiates de-dupe products. Only NetApp touts a fixed chunk size equal to data block size, according to Schulz. Most de-dupe products claim variable chunk size from the file level down to sub-block level. By using variable chunk size for data inserted into a file, you need only show the changed data as being different, and the rest of the file would be the same. Even more impressively, most de-dupe products can not only reduce data from generations of the same file but also eliminate data copies across files.

Yet another de-dupe difference is file-type sensitivity. Some target products open the backup stream to determine data type and invoke file-type-specific policies to provide better de-duplication. Sepaton claims to have the most file-specific policies. Other target vendors claim that their products perform file-specific de-dupe to a lesser extent, including Data Domain, Diligent and Quantum, which all claim they are completely agnostic regarding backup streaming.

The numbers for de-dupe compression rates range anywhere from 3:1 to 500:1. De-dupe products can sustain these high compression rates because backups generate duplicate data every time a full backup is run. Moreover, beneath the file level, most data is not unique even though a file is modified. But, some data does not de-dupe well, including audio, photo, movie and other media files that simply don't have excess white space or duplicate data.

Troutman Sanders' Thomas manages 18 Data Domain boxes throughout the U.S. They are being used as NAS targets for Symantec Backup Exec. The Data Domain boxes have a capacity of 1TB to 15TB. But, each of Troutman Sanders' NAS boxes backs up the equivilant of 400TB to 500TB of data.

The Data Domain arrays have been in production for more than eight months. Each office does a local backup to its on-site Data Domain box. That data is replicated off-site to headquarters in Atlanta and then again replicated to the law firm's secondary site. Before Data Domain, Thomas says his company had DLT-4 tape changers and LTO-1 tape. Backups at some offices used to take more than 11 hours, but now with Data Domain, the backups are under 50 minutes. Troutman Sanders backs up one site with 163 TB of data onto 4.6TB of Data Domain for a compression ratio of 34.6:1. Smaller sites have experienced a compression ratio as high as 55:1.

"The [Data Domain] hardware was comparable to buying all-new tape hardware, but the speed and the ability to not have to manage the tape is what really got us … the replication is an added bonus," Thomas says.

Grahame McKenzie, manager of IT at Crawford Adjusters Canada Inc. in Hamilton, Ontario, had performed local backups at each of the company's 80 remote sites. He now manages 80 remote and 20 local Asigra clients backing up to a single 2TB Asigra TeleVault. They mirror the central TeleVault to an off-site hardened data center and copy their monthly full backups off onto tape moving these off-site to another location.

McKenzie says that before, "we weren't even doing it [backups] on the smaller XP share points -- there was no backup. It just wasn't economical; if it was an NT server, it would [be] backed up to tape." This process was time-consuming and not subject to easy validation. He says that "with Asigra, backups all take place, and they all come here, … and I get a report on everything the next day." McKenzie says that in some cases, it was difficult to quantify a return on his investment because no backups were being performed, but he did say "we don't invest in tape on our new branch servers, and we have one tape library here."

Schulz says data de-duplication allows users to retire most of their tape infrastructure at remote and local sites. In some environments, it may not be possible to eliminate all tape processing, but it can be reduced considerably. At essentially the purchase price of replacing tape, you can get all the benefits of tape with the convenience of disk -- mainly quicker, less error-prone and less operator-intensive backups and restores.

De-dupe product comparison table
Products Purchase Inline or Offline Chunk Size Personality File type special processing De-duplication performed
Asigra

Bundled appliance

Inline

Variable

N/A

N/A

Source

Data Domain

Bundled appliance

Inline

Variable

NAS & VTL

None

Target

Diligent

Software license

Inline

Variable

VTL

None

Target

ExaGrid

Bundled appliance

Offline

Variable

NAS

CA’s ArcServe, EMC’s Networker, Symantec’s NetBackup, Symantec’s Backup Exec, CommVault’s Galaxy, IBM’s TSM

Target

EMC Avamar

Software license

Inline

Variable

N/A

N/A

Source

FalconStor

Bundled appliance

Off-line

Variable

VTL

EMC’s Networker, Symantec’s NetBackup, IBM’s TSM

Target

Network Appliance SnapVault

Bundled appliance

Inline

Fixed

N/A

N/A

Source

Quantum DXI

Bundled appliance

Offline

Variable

NAS & VTL

None

Target

Symantec NetBackup Pure Disk

Software license

Inline

Variable

N/A

N/A

Source

Sepaton

Bundled appliance

Offline

Variable

VTL

Symantec’s NetBackup, IBM’s TSM, HP’s Data Protector

Target

Copyright © 2007 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon