We've all heard the predictions: By 2020, the quantity of electronically stored data will reach 35 trillion gigabytes, a forty-four-fold increase from 2009. We had already reached 1.2 million petabytes, or 1.2 zettabytes, by the end of 2010, according to IDC. That's enough data to fill a stack of DVDs reaching from the Earth to the moon and back -- about 240,000 miles each way.
Enter "big data," a nascent group of data mining technologies that are making the storage, manipulation and analysis of reams of data cheaper and faster than ever. Once relegated to the supercomputing environment, big data technology is becoming available to the enterprise masses -- and it is changing the way many industries do business.
The open-source connection
"A lot of people consider Hadoop and big data to be synonyms. That's a mistake," Olofson says. Some implementations of Teradata, MySQL and "clever clustering technologies" that don't use Hadoop can also be considered big data, he explains.
Hadoop, an application environment for big data, has drawn the most attention because it's based on MapReduce, an approach common in supercomputing circles but simplified and made elegant by a project largely funded by Google. Hadoop is the predominant implementation of a mix of closely related Apache projects, including the HBase database found in the MapReduce environment.
In other industries, businesses are realizing that much more of their value proposition is information-based than they had previously thought, so they'll likely become big users of big data technologies before long, Olofson says. Couple that with affordable hardware and software, and enterprises find themselves in a perfect storm of business transformation opportunities.
New York-based TRA helps organizations measure the value of TV advertising by matching the advertisements received in a given home via TVs and DVRs with buying behavior at the retail checkout counter. The company gathers data from cable provider DVRs and grocery store loyalty card programs to make these correlations. TRA's big data system processes reams of data that represents the second-by-second viewing habits of 1.7 million households -- a feat that would have been impossible without big data technology. It deployed Kognitio's WX2 database, which allows the company to load, profile and analyze data quickly, collect granular ad-viewing information from DVRs, integrate it with detailed point-of-sale data, and produce customized reports.
"Kognitio has an in-memory solution, so a full half of our current entire database can be in memory, which means the response time when a customer of ours runs a query literally can be seconds as opposed to hours and days," says TRA's CEO, Mark Lieberman.
By analyzing the data, Catalina helps major consumer goods manufacturers and large supermarket chains predict what customers are likely to buy and who will be interested in new products.
"We wanted to bring the technology to the data and not the data to the technology," says Eric Williams, executive vice president and CIO at Catalina. "The technology exists now that allows companies like SAS to move their [analytics] technology into the database. That has exponentially changed the entire corporation. We were doing these things before but had serious limitations that would not allow us to get where we wanted to go. We had to use homegrown tools, and they were very rudimentary in what they could accomplish. Bringing big data technology to the forefront has changed our entire organization."
In addition to some open-source software in its proprietary systems, Catalina uses SAS Analytics on a Netezza data warehousing appliance platform.
Down the road, he foresees utilities using big data to improve service to customers and to reduce operational costs through electrical grid monitoring, problem detection and the ability to do micro-adjustments against the grid -- but it may require significant upgrades to the aging infrastructure.
Brand marketers are experimenting with Hadoop for "sentiment analysis" in social media. There are emerging service providers that use Hadoop to sift through Twitter on behalf of clients to discover what tweeters are saying and thinking about specific products.