Data deduplication in the cloud explained, part one

Everyone is talking about the benefits of storing data to the cloud for sharing information among friends, to simplify moving data between different mobile devices, and for small businesses to back up and provide disaster recovery (DR) capabilities. But what about the massive amounts of data in enterprise data centers? How do cloud providers protect your data? How is the entire Internet protected?

Let’s face it; backing up the data from your cellphone to the cloud is fairly routine. The hard job is on the back end, where service providers and large companies need to move, protect and store the massive amounts of data they have within and between their datacenters.

If you intend to move large amounts of data over a network and provide access to that data as a service, you need to be cognizant of network bandwidth requirements, data security and the total IT costs of providing those services to end users, especially when providing services for data storage and DR protection.

As an example, a basic requirement for any cloud-based data protection solution needs to be the ability to reduce the overall costs of providing the same service that clients could do themselves in their own data centers. One method being used to achieve this goal is data deduplication across multiple end-user clients, where the costs to provide the service is amortized over the number of paying clients. There are multiple methods to deduplicate data, so service providers and their customers need to be cognizant of the differences between the available solutions and the impact they may have on security and the ability to efficiently move, protect and store data in the cloud in a cost-effective manner.

Over the course of the next few blogs, I will examine benefits of deduplication, including the technology computing methods and the implementation types.

Understanding the concepts of data deduplication

Data deduplication is one of the hottest technologies in storage right now because it enables companies to save a lot of money on storage costs to store the data and on the bandwidth costs to move the data when replicating it offsite for DR. This is great news for cloud providers, because if you store less, you need less hardware. If you can deduplicate what you store, you can better utilize your existing storage space, which can save money by using what you have more efficiently. If you store less, you also back up less, which again means less hardware and backup media. If you store less, you also send less data over the network in case of a disaster, which means you save money in hardware and network costs over time.  The business benefits of data deduplication include:

  • Reduced hardware costs;
  • Reduced backup costs;
  • Reduced costs for business continuity / disaster recovery;
  • Increased storage efficiency; and
  • Increased network efficiency.

How deduplication works: Store only unique objects in a database of objects

In simplified terms, data deduplication compares objects (usually files or blocks) and removes objects (copies) that already exist in the data set. The deduplication process removes blocks that are not unique.  Simply put, the process consists of four steps:

  1. Divide the input data into blocks or “chunks.”
  2. Calculate a hash value for each block of data.
  3. Use these values to determine if another block of the same data has already been stored.
  4. Replace the duplicate data with a reference to the object already in the database.

Once the data is chunked, an index can be created from the results, and the duplicates can be found and eliminated. Only a single instance of every chunk is stored.


The actual process of data deduplication can be implemented in a number of different ways. You can eliminate duplicate data by simply comparing two files and making the decision to delete one that is older or no longer needed.

But real deduplication solutions use more sophisticated methods, where the actual math involved can make your head spin. If you really want to understand all the nuances of the different mathematical techniques used to find duplicate data, you should probably take college courses in statistical analysis, data security and cryptography, and hey, who knows, if your current line of work doesn’t pan out, maybe you could even get a job at the NSA! In my next blog, I will dive deeper into the compute and implementation methods of deduplication.

Chris Poelker on Data Deduplication

1. Basic concepts

2. Deep Dive

3. Implementation methods (free registration required)

Copyright © 2013 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon