How to Implement Next-Generation Storage Infrastructure for Big Data
CIO - Everyone is talking about Big Data analytics and associated business intelligence marvels these days, but before organizations will be able to leverage the data, they'll have to figure out how to store it. Managing larger data stores--at the petabyte scale and larger--is fundamentally different from managing traditional large-scale data sets. Just ask Shutterfly.
Shutterfly is an online photo site that differentiates itself by allowing users to store an unlimited number of images that are kept at the original resolution, never downscaled. It also says it never deletes a photo.
"Our image archive is north of 30 petabytes of data," says Neil Day, Shutterfly senior vice president and chief technology officer. He adds, "Our storage pool grows faster than our customer base. When we acquire a customer, the first thing they do is upload a bunch of photos to us. And then when they fall in love with us, the first thing they do is upload a bunch of additional photos."
To get an idea of the scale we're talking about, one petabyte is equivalent to 1 million terabytes or 1 billion gigabytes. The archive of the first 20 years of observations by NASA's Hubble Space Telescope comes to a bit more than 45 terabytes of data, and one terabyte of compressed audio recorded at 128 kB/s would contain about 17,000 hours of audio.
Petabyte-Scale Infrastructures Are Different
"Petabyte-scale infrastructures are just an entirely different ballgame," Day says. "They're very difficult to build and maintain. The administrative load on a petabyte or multi-petabyte infrastructure is just a night and day difference from the traditional large-scale data sets. It's like the difference between dealing with the data on your laptop and the data on a RAID array."
When Day joined Shutterfly in 2009, storage had already become one of the company's biggest buckets of expense, and it was growing at a rapid clip--not just in terms of raw capacity, but in terms of staffing.
"Every n petabytes of additional storage meant we needed another storage administrator to support that physical and logical infrastructure," Day says. With such massive data stores, he says, "things break much more frequently. Anyone who's managing a really large archive is dealing with hardware failures on an ongoing basis. The fundamental problem that everyone is trying to solve is, knowing that a fraction of your drives are going to fail in any given interval, how do you make sure your data remains available and the performance doesn't degrade?"
Scaling RAID Is Problematic
The standard answer to failover is replication, usually in the form of RAID arrays. But at massive scales, RAID can create more problems than it solves, Day says. In a traditional RAID data storage scheme, copies of each piece of data are mirrored and stored on the various disks of the array, ensuring integrity and availability. But that means a single piece of data stored and mirrored can inflate to require more than five times its size in storage. As the drives used in RAID arrays get larger--3 terabyte drives are very attractive from a density and power consumption perspective--the time it takes to get a replacement for a failed drive back to full parity becomes longer and longer.
- Google I/O 2013's Coolest Products and Services
- 10 Star Trek Technologies That are Almost Here
- 19 Generations of Computer Programmers
- 25 Must-Have Technologies for SMBs
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- Intelligent Systems: A Prescription for Health Care Transformation Facing an onslaught of regulatory changes and market pressures, health care providers are grappling with how to transform existing services as part of...
- Agile Computing: The Path to Strategic Agility Financial institutions globally are facing increasing regulatory requirements while operating in a more competitive environment. Learn how to leverage technology to transform your...
- Time Savings and Ease of Deployment Comparison Study - Database Appliance vs Microsoft SQL Server As the amount and importance of corporate data grows, companies of all sizes are finding that they increasingly need to deploy high-availability database...
- Case Study: Hospital Turns to Email Archiving Solution to Ensure Regulatory Compliances Read this case study to learn how a cloud-based email archiving solution enabled the hospital to meet government mandates and helps avoid thousands...
- Oracle Database Appliance Best Practices Business users increasingly demand 24x7 availability of their data while IT departments face the challenge of ensuring maximum availability while operating with limited...
-
Oracle Database Appliance - Simplifying your High Availability Database
Date: February 29, 2012
Time: 1:00 PM EST
Seasoned IT managers know from experience that in many cases the bulk of the cost of an...
All Databases White Papers |
Webcasts