InfoWorld review: Data deduplication appliances
Ever wonder why hard drive capacities continue to get bigger? Do you think IT has ever told management that they will need less storage capacity over the next three years? In fact, three years from now, your company will likely have four times as much data to store as it's storing today. The gigabytes will continue to turn into terabytes, and the terabytes will soon give way to petabytes.
Fortunately, there is a way to slow the inevitable data sprawl: Use data deduplication on your storage system. Data deduplication is the process of analyzing blocks or segments of data on a storage medium and finding duplicate patterns. By removing the duplicate patterns and replacing them with much smaller placeholders, overall storage needs can be greatly reduced. This becomes very important when IT has to plan for backup and disaster recovery needs or when simply determining online storage requirements for the coming year. If admins can increase storage usage 20, 40, or 60 percent by removing duplicate data, that allows current storage investments to go that much further.
[ How does data deduplication work? What's the difference between block-level and file-level approaches? How to choose among source, target, and inline methods? Find out by downloading InfoWorld's Data Deduplication Deep Dive Report. ]
To see what data deduplication can do, I reviewed four storage appliances that use the technology: the FalconStor FDS 304, the NetApp FAS2040, and the Spectra Logic nTier v80 and nTier vX. All four appliances provided excellent scalability, performance, and data deduplication functionality. Each solution has a bit of its own personality -- one looks like a rack of tape drives, another a large network-attached storage system, and a third a direct-connect Fibre Channel appliance.
FalconStor's FDS 304 is a 2U NAS (network attached storage) appliance utilizing SATA hard drives and gigabit and 10-gigabit Ethernet network interfaces. IT will typically deploy the FDS 304 as a disk-to-disk backup partner or as a target for disk-based backups, but it can also serve as main-line storage. NetApp's FAS2040, also in a 2U form factor, can be deployed as a Gigabit NAS, Fibre Channel, or IP SAN, as well as a Fibre Channel over Ethernet device. It too is going to see action as a target for disk-based backups and data replication; it can also be used as a general-purpose storage medium. For enterprises that have a large investment in physical tape libraries or that will be virtualizing their tape farms, Spectra Logic's nTier family is a great choice. A "drop in" virtual tape library (VTL) appliance that uses FalconStor's data deduplication engine, the nTier can replace physical tape systems or run in parallel with a physical tape library while deduplicating stored data.
All of these appliances offer an easy-to-implement, easy-to-manage, and effective data deduplication system that any enterprise network could take advantage of. Based on my tests with a highly duplicative set of Windows and Office files and their backups, you can expect similar levels of data deduplication from all of them. Note: If you plan to deduplicate system backup sets as well as raw files, you'll want to make sure that the deduplication engine works with your backup software.
There is a preconfigured virtual version of FalconStor's FDS appliance that runs on VMware ESX/ESXi 3.5 update 4 and vSphere 4; it also provides remote offices with a way to utilize data deduplication without requiring an additional piece of hardware. The virtual FDS is available in both 1TB and 2TB versions and makes it easy to bring deduplication to remote or branch offices.
The core use-case scenario for the FDS 304 is as a target for disk-based storage and backup systems. While FalconStor does offer VTL appliances, the FDS family is intended to be used as a file share for CIFS and NFS clients on the network. It is also meant to take the place of traditional tape-based backup systems. The FDS family comes with Symantec NetBackup OST support to allow tight integration between NetBackup -- or other OST-aware products -- and the appliance. While I did not test using NetBackup, FalconStor claims up to 500MBps maximum inbound speed using OST over 10Gb Ethernet.
I integrated the FDS 304 into my test bed as both a backup destination and CIFS file share. While I could have mounted data shares on the FDS as local storage using iSCSI, I decided to map a drive letter to the various shares from a Windows Server 2008 R2 box and my four virtual Windows 2008 R2 servers. I had no trouble manipulating files on the various shares from any of my servers -- each share behaved just as a typical Windows share would. I also used another share as a backup target for Symantec Backup Exec.
Over the course of my tests, I ran multiple daily backups of the five Windows servers without any issues. Unlike the NetApp FAS2040, the FDS had no problems deduping my Backup Exec backup sets. Typical file and folder deduplication was very efficient, providing nearly a 90 percent reduction on highly duplicated data. The backup sets were "full system backups," including Windows, installed applications, and Microsoft Exchange data stores. A mix of some Microsoft Word and Excel files rounded out the set. I saw virtually no difference in deduplication performance whether just a collection of files or the Backup Exec archives.
There are two choices for when to dedupe the data: on a scheduled basis or in real time (as the data is written to the disk). I set up a nightly scheduled deduplication pass, and it ran without any issues. I also was able to run manual dedupe passes when I wanted to check deduplication results immediately. The real-time deduplication policy, which analyzes the data as it is written to the device, serves to keep the data shares as deduped as possible. There is a small performance penalty when deduping in real time, but in my tests it was negligible. No matter what the deduplication needs, FalconStor will let you define a policy that fits.
I tried to fool the deduplication engine by renaming files and folders and changing extensions, but as with the other appliances, regardless of what I tried, the deduplication engine always found the duplicate blocks and either added them to the hash table or removed them to reduce overall data size. Because the deduplication engine works at the block level, it looks past such details as file name and type and correctly analyzed the file structure for the duplicate data. Regardless of the type of file -- PDF, Word document, ZIP archive, and so on -- the dedupe engine ferreted out the duplicate blocks like a champ.
FalconStor's management interface, which is virtually identical to Spectra Logic's, was easy to navigate once I became familiar with the organization of the UI. While it's not as intuitive as NetApp's System Manager, I had little trouble creating file shares, defining deduplication policy, and monitoring the health and performance of the system. I was able to easily view reports on storage usage, amount of deduped data, and percentage of storage reclaimed by deduplication. These reports will help IT keep tabs on overall storage usage and deduplication performance.
The FalconStor FDS 304 is a solid piece of network engineering and proved to be more than effective in storing data and detecting duplicate blocks of information. It makes an excellent target for disk-based backups and general file sharing. I liked the ease-of-use when creating CIFS shares, as well as the ability to serve as an iSCSI target and to export NFS shares offers a good deal of flexibility. While the reporting system isn't anything to get excited about, it does provide enough feedback on the health of the appliance to keep IT well informed.
The dashboard in the FalconStor FDS console provides an at-a-glance overview of disk usage trends.
NetApp FAS2040 Another appliance geared toward disk-based storage and deduplication is NetApp's FAS2040. This appliance allows multiple installation options for the data center, including as a SAN or NAS target, or direct via Fibre Channel. Like the FalconStor appliance, the NetApp can serve as production storage, as a backup device, or as both simultaneously.
The FAS2040 comes with up to two independent storage controllers and scales well, far exceeding that of FalconStor and Spectra Logic. In addition to CFIS and NFS protocols, the FAS2040 can also automatically export an NFS datastore to a VMware ESX server, a nice time saver for adding online disk space to an existing VMware environment. NetApp's deduplication policy didn't have the same level of flexibility as FalconStor, but it did a good job of reducing disk usage on volumes with a standard file/folder structure. However, on backup sets created by Symantec Backup Exec 2010, it didn't fare as well.
My NetApp-provided FAS2040 2U chassis was populated with a dozen 300GB SATA drives, two hot-swap storage controllers, each with four Gigabit Ethernet interfaces and two 4Gb Fibre Channel ports, and dual power supplies. My chassis was configured with two aggregates (RAID arrays) -- one for each controller -- in a dual-parity RAID configuration. To fit most any need, there are a variety of hard drives -- Fibre Channel, SAS, or SATA -- available for the FAS2040. By way of additional external drive chassis, the FAS2040 can access a maximum of 136TB of raw space, far more than the other chassis reviewed here.
I installed the FAS2040 on my test network via Gigabit Ethernet, connecting independently to both controllers in the chassis. I carved both aggregates into multiple volumes and shares, defining some as CIFS file shares while setting others up as iSCSI targets. (Like the other systems reviewed, the NetApp also allows you to create NFS shares for Linux/Unix clients.) As with the FalconStor and Spectra Logic appliances, I used the NetApp's various CIFS shares as NAS file storage and as a backup destination for my physical and virtual Windows Server 2008 machines. I had no trouble using both mapped drives and UNC (Universal Naming Convention) connections to the NetApp from all of my servers, physical and virtual. I also had no trouble mounting iSCSI shares as local storage using Microsoft's iSCSI initiator in Windows Server 2008. Each mounted volume behaved exactly like local storage.
One feature I really liked in the FAS2040 was the dual storage controllers. Depending on your needs and the configuration of the appliance, one chassis can serve as its own Active/Passive failover device. In case one controller should suffer a catastrophic failure, the other controller can take over transparently. Or, as in my case, you can use both controllers in an Active/Active configuration, if you want both controllers online and providing independent storage to your network.
Part of my testing involved simple file copies to the shares on the NetApp, while the other was based on using the NetApp as a destination for multiple Backup Exec jobs. The NetApp's deduplication of files and folders was impressive, showing excellent detection and elimination of duplicate or partially duplicate data. Like the FalconStor and Spectra Logic appliances, data reduction of highly duplicative file shares easily passed 90 percent. However, I was surprised at the trouble the NetApp had with the Backup Exec backup files.