To reduce or dedupe, that is the big data question

Reducing data at the source is the smart way to do backup. That is the conclusion I came to in my last post, If files were bricks, you'd change your backup strategy.  But I also left off by saying “there are technologically different ways to do this, which have their own smart and dumb aspects.” Let’s take a look at them. 

There are two common ways of reducing data at the host (as I mentioned last time, I am only considering traditional backup from servers, not disk-array snapshots). Since terminology can be used in different ways, I’ll define the terms as I use them.

Data deduplication: A process that examines new data blocks using hashing compares them to existing data blocks, and skips redundant blocks when data is transferred to the target.

Data reduction: A process that tracks block changes, usually using some kind of log or journal, and then transfers only new blocks to the backup target.


For the most efficient data backup, only changed data blocks should be moved.

The end result of each process is similar – but not the same – yet there are major differences in how things work in the real world. Let’s look at each in turn.

Deduplication is the most effective at reducing the quantity of data that gets sent if the dedupe is universal, which it generally is for products that use it. Data reduction methods are limited to a single server because they rely on a local journal, and the journal doesn’t know that a particular block of data may have already been sent from some other node. Realistically, this benefit is mostly derived in the initial backup set when you can avoid sending the same operating system bits over and over. After that, there are a lot less differences.

The largest downside for deduplication is that it uses a process that is computationally heavy. Hashing data uses significant system resources and that causes application impact and slowed response times. This can be minimized by tracking which files change and only hashing those files that have been updated, but that advantage disappears with systems that have high rates of change or large database files. All it takes for a database file is a single transaction to make the entire file “new,” meaning it has to be re-hashed. In fact, some vendors will even recommend you don’t use their deduplication on databases because of the impact it creates.

That brings up another key thing to keep in mind about host deduplication. Vendors like to make claims about how much deduplication reduces backup time, but when you examine what they mean they are referring only to the data transfer time. They conveniently leave out the data hashing time. If a backup takes five hours to hash the data set and then transmits the blocks in 30 minutes, vendors will say it was a “30 minute backup” when it was really a five and a half hour backup in terms of system impact. Watch out for this when you are evaluating deduplication products.

Switching to data reduction, the great benefit of this approach is that it creates very little system impact at all. The key is that data is never hashed, so there are no high impact calculations being performed. Instead, block changes are logged as they happen. In terms of system impact, this tracking process barely even registers on a server. When it’s time to run the backup, the changed blocks are already known and they can be moved very efficiently via a block-to-block copy (sometimes dedupe products will move data via the file system, creating another point of stress).

An added benefit of data reduction: because impact is so low, you can usually run multiple backups per day without much trouble, giving you better recovery points. The size of the files, like databases, make no difference because nothing is done at a file level. With data deduplication it is close to impossible to protect a database more than once a day because the hashing process creates too much impact and takes too much time. 

So which to choose? As always, it depends on your specific needs. If reducing the impact of backup processing is a major concern, then data reduction is clearly better. If minimizing bandwidth utilization is key, then deduplication may be better, though the difference will depend on application characteristics.

But don’t forget that backup is only part of the data protection challenge. You have to look at recovery scenarios as well and there again you can find large differences between products and techniques. As a technology buyer, always do your homework, ask lots of questions, and focus like a laser on your specific server and application mix. A product may sound great “in general,” but nobody uses a product “in general.” You use it in your data center, not someone else’s data center. And if a vendor can’t explain to you precisely how their product will help specific to your own environment, then the only thing left to show them is the door. 

To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
7 Wi-Fi vulnerabilities beyond weak passwords
Shop Tech Products at Amazon