Opinion: Demystifying de-duplication
De-duplication can be applied anywhere there is a significant amount of data commonality
Computerworld - Of the assortment of technologies swarming around the storage and data-protection space these days, one that can be counted on to garner both lots of interest and lots of questions among users is de-duplication. The interest is understandable, since the potential value proposition, in terms of reduction of required storage capacity, is at least conceptually on a par with the ROI of server virtualization. The win-win proposition of providing better services (e.g., disk-based recovery) while reducing costs is undeniably attractive.
However, while the benefits are obvious, the road to get there isn't necessarily as clear. How does one make a decision to adopt a particular technology when that technology manifests itself in so many different forms? De-duplication, like compression before it, can be incorporated into a number of different products types. While by no means a complete list, the major options for our purposes include backup software, NAS storage devices, and virtual tape libraries.
Even within these few categories, there are dramatic differences in how de-duplication is implemented, with each offering having its own benefits. The scorecard of feature trade-offs includes the following:
- Source vs. target de-duplication
- Inline vs. postprocessing
- Global vs. local span
- Single- vs. multiple-head processing
- Indexing methodology
- Level of granularity
As with any set of products, these trade-offs reflect optimization for specific design or market targets: high performance, low cost, enterprise, SMB, etc. For more detail on the range of de-duplication options and their implications, you may want to check out my colleague Curtis Preston's Backup Central blog.
Until recently, one aspect of de-duplication that was generally unquestioned was its focus: secondary data, particularly backup. However, there are growing signs that this too is changing. In theory, de-duplication can be applied anywhere there is a significant amount of data commonality — which is why backup is such a good fit.
However, if we look around for more examples of high data commonality, one area that comes to mind is virtualized server environments. Consider the number of nearly identical virtual C: drives in a VMware server cluster, for example. Recently, NetApp has been leading the way among storage vendors in suggesting de-duplication for primary storage in these environments. In fact, it has been steadily expanding its support of de-duplication, initially offering the technology on its secondary NearStore platforms, then on its primary FAS line, and as of last week on its V-Series NAS gateways, where it can de-duplicate the likes of EMC, HDS, HP and other storage.
Of course, for many, this is uncharted territory and the performance and management impact needs to be better understood. But given the higher costs of primary storage vs. secondary, the potential to achieve a 20:1 savings in storage, even for just a portion of the environment, is quite tempting.
Jim Damoulakis is chief technology officer of GlassHouse Technologies Inc., a leading provider of independent storage services. He can be reached at jimd@glasshouse.com.
Read more about Networked Storage in Computerworld's Networked Storage Topic Center.


- Excel 2010 Cheat Sheet
- Register for this Computerworld Insider Cheat Sheet and gain access to hundreds of premium content articles, guides, product reviews and more.
- Protecting Against Database Attacks and Insider Threats: Top 5 Scenarios
- Read this new eBook to learn the top five scenarios and essential best practices for preventing database attacks and insider threats.
- Effect of UPS on System Availability
- This white paper explains how system availability and uptime are affected by AC power outages and provides quantitative data regarding uptime in real-world...
- Preventing Data Corruption in the Event of an Extended Power Outage
- In this whitepaper various software configurations are discussed, and best practices aimed at ensuring uptime are presented.
- Archiving for Dummies e-book
- Want the smartest IT department? Just surround yourself with Dummies. Data retention in today's digial era is a major challenge for businesses and...
- Practice Management: Double Billing Rate and Improve Patient Services
- Would you like to double your billing rate and achieve faster payment for services?
Download this customer success story to see how One Health...
All Networked Storage White Papers
- Distributed Database Security with Real-time Monitoring
- View this demo and learn how IBM InfoSphere Guardium database activity monitoring can help protect your sensitive data in distributed DBMS environments with...
- InfoSphere Warehouse Packs Demo
- These flash modules make warehousing more tangible and relevant to business users through detailed explanations of the InfoSphere Warehouse Packs.
- Delivery Management -- Extending Lifecycle Management
- Date: Wednesday, June 20, 2012, 1:00 PM EDT
Siloed organizations continue doing the wrong things and doing things wrong, leading to increased costs,... - Leverage automation today to reduce IT complexity
- Date: Tuesday, June 5, 2012, 2:00 PM EDT
Whether your B2B complexity is caused by multiple technologies due to M&A, business or application specific... - Redefine Expectations in the Data Center
- Need to do more with less? Watch this video to learn how HP ProLiant Gen8 servers can help your business deploy servers three... All Networked Storage Webcasts