Enough data to fill a stack of DVDs to the moon (and back)

As should become obvious throughout the course of this blog, I believe that storage needs to change radically -- in terms of technology, architecture, deployment, and -- perhaps most importantly -- economics.

The reasons for this are quite simple:

a) The amount of data that needs to be stored is growing at unprecedented rates, and there is no reason to expect that to change any time soon.

b) The computing and networking worlds have changed radically, and storage needs to change to support the new computing paradigm.

For this post, I'd like to deal with the first of these items -- the growth in data.

It's no surprise that the amount of data being created is growing. But, you may be surprised by how much. To put things in perspective, the entire works of William Shakespeare (in text form) represent about 5 MB of data. So, you could store about 1,000 copies of Shakespeare on a single DVD. The text in all the books in the Library of Congress would fit comfortably on a stack of DVDs the height of a single-story house.

Last year, the world created enough digital data to fill a stack of DVDs that would stretch from Earth to the moon and back. The amount of unstructured data created this year is expected to be about 60 percent greater than last year. (And, of course, we're not throwing away all of the data created last year or the years before.) According to some estimates, the "stack of DVDs" will reach Mars by the end of the decade.

This is a pattern across almost every industry. More than 13 million hours of video were uploaded to YouTube during 2010 and 35 hours of video are uploaded every minute, -- driven by the fact that a growing percentage of the 4.6 million mobile phones are now capable of recording video. In the medical field, imaging is getting more and more high-resolution, and we are not too far off from the day in which part of our medical records will include genomic data. Nuclear physics experiments at the Large Hadron Collider at CERN generate 40 Terabytes every second.

In short, we're generating lots of data. And that data is of a fundamentally different character than the data problems of only a decade ago.

Most of today's storage technology was built around the design center of a "large database": for example, the transactional history of a bank. A large database might be 500 GB and would consist of highly structured rows and columns. Perhaps most importantly, the consequences of losing or corrupting even a byte of data in a transactional database would be huge, so enterprises were willing to spend almost anything to protect that data. Not surprisingly, the solutions built were geared towards structured data (i.e. block-based) and were delivered as highly reliable -- and highly expensive -- proprietary, monolithic solutions. Given the extreme sensitivity of that 500 GB database, it's not surprising that the storage manufacturers felt comfortable charging substantial premiums for their monolithic solutions.

Today, the problem has shifted to unstructured data: video, images, nuclear data, seismic files and the like. A 500 GB database would be considered massive, but your typical consumer can create that much data (in the form of videos, music or photos) in a matter of days. While this unstructured data is important, and demands good performance and availability, it is simply unfeasible to spend the same amount per GB to support a consumer's video trove as it is to support a bank's transactional history.

Unfortunately, while the price of networked storage has dropped over the years (about 15 percent per year), storage has in no way kept pace with Moore's Law or the change in the price of the CPU cycles that generate that data, or the 60 percent growth in the amount of data being created. For an organization that experienced these trends, the storage budget (for new data alone) would need to grow 45% per year, more than tripling at the end of five years.

Given that fundamental economic disparity, and given the technological changes that are hitting the computing industry (more in the next blog post), it is clear that the conditions are ripe for a revolution in storage. Stay tuned!

Storage Budget

Figure 1: Theoretical Increase in Storage Budget required for 60% data Growth and 15% Cost/TB decline

Ben Golub is President and CEO at Gluster. He is on Twitter @golubbe.

Crash Course: Advanced beginner's guide to R
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies