Big data to drive a surveillance society

Analysis of huge quantities of data will enable companies to learn our habits, activities

1 2 3 Page 3
Page 3 of 3

MapReduce systems differ from traditional databases in that they can quickly presort data in a batch process, regardless of the type of data: file or block. They can also interface with any number of languages: including C++, C#, Java, Perl, Python and Ruby. Once sorted, a more specific analytical application is required to perform specific queries. Traditional databases can be considerably slower, requiring table-by-table analysis. They also don't scale nearly as well.

For example, Alfred Spector, vice president of research and special initiatives at Google, said it's not inconceivable that a cluster of servers could someday include 16 million processors creating one MPP data warehouse.

"It doesn't seem there are any limitations other than good engineering required to get there," he said. "Moore's law or not, we have essentially unlimited computation available."

Spector sees a day when distributed computing systems will offer Web developers what he calls "totally transparent processing," where the big data analytics engines learn over time to, say, translate file or block data into whatever language is preferred based on a user's profile history, or act as a moderator on a website, identifying spam and culling it out. "We want to make these capabilities available to users through a prediction API. You can provide data sets and train machine algorithms on those data sets," he said.

Yahoo has been the largest contributor to Hadoop to date, creating about 70% of the code. The Web search company uses it across all lines of its businesses and has standardized on Apache Hadoop, because it favors that version's open-source attributes.

Papaioannou said Yahoo has 43,000 servers, many of which are configured in Hadoop clusters. By the end of the year, he expects his server farms to have 60,000 machines because the site is generating 50TB of data per day and has stored more than 200 petabytes.

"Our approach is not to throw any data away," he said.

That is exactly what other corporations want to be able to do: Use every piece of data to their advantage so nothing goes to waste.

Lucas Mearian covers storage, disaster recovery and business continuity, financial services infrastructure and health care IT for Computerworld. Follow Lucas on Twitter at @lucasmearian or subscribe to Lucas's RSS feed . His e-mail address is

Copyright © 2011 IDG Communications, Inc.

1 2 3 Page 3
Page 3 of 3
Shop Tech Products at Amazon