Everything Must Stay!

The cost of disk storage is so low that the easiest thing for companies to do is to just hold on to all their data. And that opens up radical new possibilities.

A company today can buy a terabyte of enterprise-class disk storage for about $5,000. Eight years ago, it would have cost $200,000. Even the dramatic drop in the cost of processing, as predicted by Moores Law, doesnt happen that fast.

The cost of storage is plummeting just in time for consumers to save all those digital photos, videos and songs theyve developed an appetite for and just in time for companies to comply with new regulations on document retention. And while theyre at it, companies might as well hold on to all of their sales transactions and do some data mining and analysis.

But theres more going on here than a linear extrapolation of capabilities. Size matters, and users with big storage systems are likely to find themselves at a tipping point, empowered with fundamentally new capabilities.

Since storage is almost free, you can kind of keep everything now, says Kunle Olukotun, a professor of electrical engineering and computer science at Stanford University. And everything is just what you need for an increasingly popular class of techniques called statistical machine learning, he says. As the name suggests, the idea is for a system to develop its own rules of logic by discovering patterns and relationships in data rather than having a programmer hard-code the rules in advance.

Kunle Olukotun

Kunle Olukotun Theres this notion of using large amounts of data to do things that previously were done by clever algorithms for example, language translation, says Olukotun. Traditionally, automated language translation has been accomplished via bilingual dictionaries and databases of linguistic rules. Google Inc. uses that method for translating among English, Spanish, German and French. But its using machine learning in experimental translation engines for Arabic, Chinese and Russian.

The Google guys said, Why dont we just throw massive amounts of data at it and look at lots of examples of the source and destination languages to come up with the rules? Olukotun says. The more data you have, the better the rules get.

Access to vast amounts of data will put the solutions to some kinds of problems within reach, predicts Rick Rashid, senior vice president for research at Microsoft Corp. We can think about analyzing huge amounts of epidemiological data to find solutions to medical problems, or think about traffic flow and urban planning and energy usage, he says. Theres a huge amount of data out there that could help us manage our society.

Where there are large stores of data, whether the company realizes it or not, there is a gold mine; there is free money, says Kevin Scott, vice president of engineering at AdMob Inc. in San Mateo, Calif. Machine learning is going to be in wider and wider use as people begin to understand that it can fundamentally change the value proposition of your business.

Scott, who was recently a senior engineering manager at Google, says companies with large volumes of sales transaction data can use it for collaborative filtering, a way to infer information about a shopper by comparing his transactions with those of similar users. Its how Amazon.com Inc. guesses that you might be persuaded to buy a book about digital photography when you order Adobe Photoshop.

If your data volumes are low, that kind of thing is impossible, Scott says.

Building Structure

The data growing most quickly in volume today is unstructured information, which cant be readily parsed and analyzed by automated means. More than 80% of all digital information in companies is in unstructured documents such as e-mail messages, images and the like, according to research firm IDC. That provides a huge opportunity for data mining via techniques akin to machine learning, Scott says.

For example, he says, a large company could pass many thousands of free-form performance reviews through algorithms that learn which data represents an employee name, an annual raise and so on. The systems would extract structure from unstructured information. Then a query tool could answer a question such as, What is my average employee rating by geographic region each year for the past 10 years? It would not be necessary to laboriously put the reviews into an indexed database.

Techniques such as machine learning and text mining wont be used widely until sometime well into the future, but many organizations already need to store all of their data for other purposes. The emergent property we are seeing is that companies are saving everything, says Richard Villars, an analyst at IDC. We have this explosion in rich content, and its not just consumers with digital phones and videos and music. Its hospitals moving to electronic records and X-rays and MRIs, and banks going to video surveillance, and then archiving that for years at a time.

In fact, some companies save individual pieces of data multiple times. Intelli­dyn Corp. in Hingham, Mass., has 70TB of data, covering things such as demographics, lifestyles, credit histories and property transactions, on 200 million adults in the U.S. CEO Peter Harvey says multiple credit agencies send files that have 90% overlap, but it is cheaper and easier to store the extra data than to purge it. And he says Intellidyn itself duplicates data by setting up private data marts for clients so they wont have to compete for the same piece of information in simultaneous queries.

The Bottom Fell Out Graph

Harvey says Intellidyns storage will grow to 2 peta­bytes within two years as the company begins capturing and saving more transaction details. So what? he shrugs. Storage is getting cheaper every day.

But while raw disk space may be, as Olukotun says, almost free, one should not confuse a doubling in disk capacity with a doubling in performance from a disk system, Olukotun adds. Weve got this huge mismatch between the transfer bandwidth and latency of the disk and how much you can store on the disk, he says.

A traditional way to get data on and off a disk faster has been to increase the rotational speed of the disk, but mechanical limits cap that speed at about 15,000rpm, says Villars. That could open the door to newer technologies, he says, such as persistent flash memory, which has no moving parts.

In the meantime, says AdMobs Scott, users will have to compensate for the transfer bandwidth bottleneck by being clever designers.

We are entering an era where programmers will have to be fairly savvy about the performance of the programs they write, Scott says. They wont be able to make silly decisions like, Oh, I have a 1TB disk, so I will write a program to do a linear scan through 500GB of data to find an e-mail. That is not going to work.

Where Will All The Data Go?

Copyright © 2007 IDG Communications, Inc.

  
Shop Tech Products at Amazon