Skip the navigation

Extreme storage: Tips from heavy-duty data users

How do you manage a vast data store? The Library of Congress and two other big storage users share some tips.

By John Brandon
October 10, 2011 06:00 AM ET

Computerworld - If you think the storage systems in your data center are out of control, imagine having 450 billion objects in your database or having to add 40 terabytes of data each week.

The challenges of managing massive amounts of data involve storing huge files, creating long-term archives and, of course, making the data accessible. While data management has always been a key function in IT, "the current frenzy has taken market activity to a whole new level," says Richard Winter, an analyst at WinterCorp Consulting Services, which analyzes big data trends.

New products appear regularly from established companies and startups alike. Whether it's Hadoop, MapReduce, NoSQL or one of several dozen data warehousing appliances, file systems and new architectures, the segment is booming, Winter says.

Some IT shops know all too well about the challenges inherent in managing big data. At the Library of Congress, Amazon and Mazda, the task requires innovative approaches for handling billions of objects and peta-scale storage mediums, tagging data for quick retrieval and rooting out errors.

1. Library of Congress

The Library of Congress processes 2.5 petabytes of data each year, which amounts to around 40TB each week. And Thomas Youkel, group chief of enterprise systems engineering at the library, estimates that the data load will quadruple in the next few years, thanks to the library's dual mandates to serve up data for historians and to preserve information in all its forms.

The library stores information on 15,000 to 18,000 spinning disks attached to 600 servers in two data centers. More than 90% of the data, or over 3PB, is stored on a fiber-attached SAN, and the rest is stored on network-attached storage drives.

The Library of Congress has an "interesting model" in that part of the information stored is metadata -- or data about the data that's stored -- while the other is the actual content, says Greg Schulz, an analyst at consulting firm StorageIO. Plenty of organizations use metadata, but what makes the library unique is the sheer size of its data store and the fact that it tags absolutely everything in its collection, including vintage audio recordings, videos, photos and other media, Schulz explains.

The actual content -- which is seldom accessed -- is ideally kept offline and on tape, Schulz says, with perhaps a thumbnail or low-resolution copy on disk.

Today, the library holds around 500 million objects per database, but Youkel expects that number to grow to as many as 5 billion. To prepare, Youkel's team has started rethinking the library's namespace system. "We're looking at new file systems that can handle that many objects," he says.

Gene Ruth, a storage analyst at Gartner, says that scaling up and out correctly is critical. When a data store grows beyond 10PB, the time and expense of backing up and otherwise handling that much data go quickly skyward. One approach, he says, is to have infrastructure in a primary location that handles most of the data and another facility for secondary, long-term archival storage.

2. Amazon.com

E-commerce giant Amazon.com is quickly becoming one of the largest holders of data in the world, with around 450 billion objects stored in its cloud for its customers' and its own storage needs. Alyssa Henry, vice president of storage services at Amazon Web Services, says that translates into about 1,500 objects for every person in the U.S. and one object for every star in the Milky Way galaxy.

Some of the objects in the database are fairly massive -- up to 5TB each -- and could be databases in their own right. Henry expects single-object size to get as high as 500TB by 2016. The secret to dealing with massive data, she says, is to split the objects into chunks, a process called parallelization.

In its S3 storage service, Amazon uses its own custom code to split files into 1,000MB pieces. This is a common practice, but what makes Amazon's approach unique is how the file-splitting process occurs in real time. "This always-available storage architecture is a contrast with some storage systems which move data between what are known as 'archived' and 'live' states, creating a potential delay for data retrieval," Henry explains.

Another problem in handling massive data is corrupt files. Most companies don't worry about the occasional corrupt file. Yet, when dealing with almost 450 billion objects, even low failure rates become challenging to manage.

Amazon's custom software analyzes every piece of data for bad memory allocations, calculates the checksums, and analyzes how fast an error can be repaired to deliver the throughput needed for cloud storage.



What is Tech Briefcase?
TechBriefcase is a new, free service where IT Professionals can Search, Store and Share IT white papers and content like this. Learn more
Bookmark content
Speed up your research efforts with content across the web.
Search and Store
Find the white papers you need. Create folders for any topic.
View Anywhere
Open your briefcase on your iPhone, tablet or desktop. Share with colleagues.
Don't have an account yet?
Additional Resources
Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Storage White Papers
IDG Tech Dossier: Converged Storage ~ A Next Gen Virtualized Architecture
Organizations need a strategy for rearchitecting storage so that it enables, rather than constricts, the delivery of IT services. According to HP, it's...
ESG Lab Review: HP 3PAR Peer Motion Software
This ESG Lab review documents hands-on testing of HP 3PAR Peer Motion Software's distributed volume management with a focus on federated workload balancing,...
IDG Tech Dossier: Converged Storage ~ Utility Storage: The Ideal Platform for Virtual and Cloud Computing
Server virtualization has transformed corporate IT -- companies have enjoyed major cost savings and have gained flexibility and efficiency. But this has also...
IDG Tech Dossier: Converged Storage ~ A Next Gen Storage Strategy for Big Data
Implementing Converged Storage is an evolution and does not require immediate wholesale replacement of current systems. But by putting a plan into place...
Hybrid Storage: How to Get the Best of Solid-state and Disk
Traditional disk storage has struggled to keep up with the I/O pressure in virtualized environments. SSD-only storage is relegated to the fringe due...
All Storage White Papers
Storage Webcasts
Live Webcast
Today's NAS: A Solution Beyond Old Limits
Date: Tuesday, July 17, 2012 2:00 PM EDT

Traditional NAS systems don't scale beyond fixed limits. Proliferation of NAS systems leads to management...
Today's NAS: A Solution Beyond Old Limits
Date: Tuesday, July 17, 2012 2:00 PM EDT

Traditional NAS systems don't scale beyond fixed limits. Proliferation of NAS systems leads to management...
Redefine Expectations in the Data Center
Need to do more with less? Watch this video to learn how HP ProLiant Gen8 servers can help your business deploy servers three...
Oracle Database Appliance Best Practices
Business users increasingly demand 24x7 availability of their data while IT departments face the challenge of ensuring maximum availability while operating with limited...
Data Privacy and Protection in Production Environments: New Research from Ponemon Institute
Date: Wednesday, June 13, 2012, 1:00 PM EDT / 10:00 AM PDT

In a recent study conducted by Ponemon Institute, fifty-five percent of respondents...
BMC Control-M - Single Point of Control Demo
With BMC Control-M, you schedule and manage everything - down to the very last platform and application - from one simple interface. It's...
All Storage Webcasts
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs