Data deduplication in the cloud explained, part two: The deep dive

In part one of this series, I covered the basic concepts of data duplication.   Before getting to the next installment, I wanted to take a second and apologize to the readers for the long delay between posts. It's a long story, but the good news is I am back and ready to go. 

Here in part two, I dig a bit deeper into the guts of how deduplication works and why it's important in ensuring that our personal and business data is continually and efficiently protected. This is a concern for almost every person who backs up their data to the cloud or share's information with friends and families over the network. In the case of some businesses, cloud service providers might be the preferred method to back up files in the event of a disaster or hardware failure. This is why many service providers and enterprises rely upon the method of deduplication to keep storage costs in check. Expanding upon my last blog on the business benefits of deduplication, let's dive into the details of how this technology works by looking at the different compute methods vendors are using within their dedupe offerings.

The most common methods of implementing deduplication are:

  • File-based compare
  • File-based versioning
  • File-based hashing
  • Block or sub-block versioning
  • Block or sub-block hashing


File-based compare

File system-based deduplication is a simple method to reduce duplicate data at the file level, and usually is just a compare operation within the file system or a file system-based algorithm that eliminates duplicates. An example of this method is comparing the name, size, type and date-modified information of two files with the same name being stored in a system. If these parameters match, you can be pretty sure that the files are copies of each other and that you can delete one of them with no problems. Although this example isn't a foolproof method of proper data deduplication, it can be done with any operating system and can be scripted to automate the process, and best of all, it's free. Based on a typical enterprise environment running the usual applications, you could probably squeeze out between 10 percent to 20 percent better storage utilization by just getting rid of duplicate files.


Example: File1.txt and File2.txt are the same size and have the same creation time. Most likely, one is a duplicate.

File-based delta versioning and hashing

More intelligent file-level deduplication methods actually look inside individual files and compare differences within the files themselves, or compare updates to a file and then just store the differences as a "delta" to the original file. File versioning associates updates to a file and just stores the deltas as other versions. File-based hashing actually creates a unique mathematical "hash" representation of files, and then compares hashes for new files to the original. If there is a hash match, you can guarantee the files are the same, and one can be removed.

A lot of backup applications have the versioning capability, and you may have heard it called incremental or differential backup. Some backup software options (Tivoli Storage Manager is a good example) always use the versioning method to speed backup. You do a full backup the first time, and from then on, only the changes in the data need to be stored. IBM calls this "progressive" backup.

Other software solutions use similar techniques to reduce wide area network (WAN) requirements for centralized backup. Intelligent software agents running on the client (desktop, laptop or workstation) use file-level versioning or hashing at the client to send only delta differences to a central site. Some solutions actually send all data updates to the central site and then hash the data once it arrives, storing only the unique data elements.

Most products that use a "hashing" mechanism also require an index to store the hashes so that they can be looked up quickly to compare against new hashes to see if the new data is unique (i.e., not already stored), or there is a hash match and the new data element does not need to be stored. These indexes must be very fast or handled in such a manner that the unique data stored increases and becomes fragmented so that the solution doesn't slow down during the hash lookup and compare process.

Different solutions from various vendors use diverse hashing algorithms, but the process is basically the same. The term "hashing the data" means "creating a mathematical representation of a specific dataset that can be statistically guaranteed to be unique from any other dataset." The way this is done is to use a generally understood and approved method to encrypt each dataset, so that the metadata or resulting mathematical encryption "hash" can be used to either reproduce the original data or as a lookup within the index to see if any new data hashes compare to any stored data hashes, so the new data can be ignored.

Block delta versioning and hashing

Block-based solutions work based on the way data is actually stored on disk and do not need to know anything about the files themselves, or even the operating system being used. Block delta versioning and hashing solutions can be used on files (un-structured data) and databases (structured data). Block delta versioning works by monitoring updates on disk at the block level, and storing only the data that changed in relation to the original data block. Block-level delta versioning is how snapshots work. Each snapshot contains only the changes to the original data.

Block-level delta versioning can also be used as a method to reduce data replication requirements for disaster recovery (DR) purposes. Let's say your company wants to keep the remote data up to date every six hours, so you have to replicate changes every six hours to the DR location. If a block of data on disk at the local site is updated hundreds of times during the time delta between the last replication and the new one, but the replication solution uses block delta versioning, only the last update to the block needs to be sent, which can greatly reduce the amount of data traveling from the local site to the DR site.

Block-level hashing works similar to file-level hashing, except in this case, every block or chunk of data stored on the disk is mathematically hashed and the hashes are indexed. Every new block of data being stored is also hashed, and the hashes are compared in the index. If the new data hash matches a hash for a block already stored, the new data does not get stored, thus eliminating duplicates.




Sub-block delta versioning and hashing

Sub-block-level delta versioning and hashing methods work exactly the same as the block method, except at a more granular level. Sub-block delta versioning works at the byte level and can be many times more efficient in reducing duplicate data than block level. For example, open system servers from Windows, Unix and Linux format disks into sectors of 512 bytes each. The smaller you chunk the data, the more probable it is that you will find a duplicate, but as smaller chunks are used, more hashes are required. There is usually a tradeoff between the deduplication ratio and the size and therefore speed of the hash index.


Sub-block versioning

A block of data on a Windows server usually takes up eight 512 byte sectors for each four kilobyte block of data being stored. Since one of the smallest updates to a disk usually occurs at the sector level, if only one sector is updated, then why mark the entire block as updated? A sub-block delta versioning solution that monitors updates at the sector level is eight times more efficient than one that simply tracks block updates, and it can be up to 64 times more efficient than other replication solutions that sometimes use 32K tracks as the smallest monitored update.


An update to a single 512 byte sector on a solution that tracks updates at the 32K track level would need to send the entire track. That's a lot of "white space" and can be an inefficient use of network bandwidth and storage. Sub-block-level delta versioning is also known as "micro-scanning" in the industry.




To hash or not to hash

Hashing-based dedupe solutions typically provide great results in reducing storage requirements for a particular data set, but there is one huge disadvantage over delta versioning. Since everything is stored as a jumble of mathematical hashes, objects and indexes, it requires the data to be "re-constituted" prior to being usable again for applications. This re-constitution process takes time, which may have a negative impact if the data needs to be recovered NOW. Micro-scanning solutions have a slightly lower overall ratio for a particular dataset, but the data is always in the native format of the application and is always immediately available for use. This is important when quick application recovery is the goal. Another benefit of micro scanning is the ability to restore only the sectors required to recover any lost or corrupted data, so massive databases like data warehouses can sometimes be recovered over the network almost instantly.

In the final part of this deduplication series, I will examine the various implementation methods of data deduplication.

See also:

Chris Poelker on Data Deduplication

1. Basic concepts

2. Deep Dive

3. Implementation methods (free registration required)

This article is published as part of the IDG Contributor Network. Want to Join?

To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
Windows 10 annoyances and solutions
Shop Tech Products at Amazon
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.