Big data to drive a surveillance society

Analysis of huge quantities of data will enable companies to learn our habits, activities

1 2 3 Page 2
Page 2 of 3

Todd Papaioannou, vice president of cloud architecture at Yahoo, said instead of thinking about big data analytics as a weapon that empowers corporate Big Brothers, consumers should regard it as a tool that enables a more personalized Web experience.

"If someone can deliver a more compelling, relevant experience for me as a consumer, then I don't mind it so much," he said.

Yahoo on Wednesday launched a new upgraded search engine called Search Direct. Similar to Google's Instant offering, Yahoo's Search Direct delivers more rich content to users based on search history. For example, if a user wanted to search using the term "New York," he would begin seeing results as soon as he started to type, and once he entered the words "New York" into the search window, the most popular searches that include those two words would instantly come to the top of the list, before he finished typing the full phrase.

Marc Parrish, vice president of retention and loyalty marketing for bookseller Barnes & Noble, said the amount of machine-generated data has "exploded" since electronic book sales have taken off.

"Our Web logs on how customers are using e-readers and e-books ... have produced 35TB of data and will load us up with another 20TB this year," said Parrish, who noted that e-book sales outstripped hardback book sales on Amazon.com last year.

With that kind of data, the retailer can determine consumer behavior, such as what percentage of shoppers make book-buying decisions based on their fondness for a particular author.

"We have to decide with analytics on hand how we capture the customer's imagination and how we move forward," he said.

Other companies are using big data analytics to track the use of content on their websites in order to better tailor that content to users' tastes.

Sondra Russell, a metrics analyst at National Public Radio, said she needed a way to track website audience usage trends in near real time. NPR offers podcasts, live streams, on-demand streams and other audio content on its website. Her organization had been using the Web analytics engine Omniture, but it felt like she was trying to jam log-based data into a client-side tracking system that couldn't handle the volume.

Russell said NPR experienced query delays that, at best, lasted six to 12 hours and, at worst, lasted for weeks. The organization finally switched to a reporting tool from Splunk that crawls logs, metrics and other application, server and network data and indexes it in a searchable repository.

"I just want to know how many times someone listened to a program during a certain period of time," she said. "With Splunk, I had no delays between data appearing in a query folder and data appearing in reports. I can get any number of graphs without weeks of prep time."

IBM's Jonas compared big data to puzzle pieces, which don't look like anything on their own but create a detailed picture once you put them together. That's where Hadoop, Cassandra and other analytics engines come in. Hadoop is a distributed software file system, based on Google's MapReduce algorithm, that allows large-scale computations (batch processing) to be performed across large server clusters in parallel. The computations can be performed on user or machine-generated data, whether structured or unstructured. But Hadoop works best on unstructured random data sets, allowing analytics engines to more quickly gather information from queries.

1 2 3 Page 2
Page 2 of 3
  
Shop Tech Products at Amazon