How to manage big data overload

Complex requirements and relentless demands for capacity vex storage administrators. Here's how to handle the data deluge.

1 2 3 Page 3
Page 3 of 3

Shutterfly eventually adopted erasure code technology, where a piece of data can be broken into chunks, each useless on its own, and dispersed to different disk drives or servers. At any time, the data can be fully reassembled with a fraction of the chunks, even if multiple chunks have been lost due to drive failures. In other words, you don't need to create multiple copies of data; a single instance can ensure data integrity and availability. Because erasure codes are software-based, the technology can be used with commodity hardware, bringing down the cost of scaling even more.

One of the early vendors of erasure-code-based software is Cleversafe, which has added location information to create what it calls dispersal coding, allowing users to store chunks -- or slices, as it calls them -- in geographically separate places, like multiple data centers.

Mega-Big-Data Users

Like Shutterfly, enterprises with massive storage needs must look beyond block storage, Nadkarni says. "When you're talking about massive data sets in the petabyte range, you have to look at either object-based storage or a distributed file system," he says. "Think about [commercial offerings like] EMC's Isilon scale-out storage or the Dell Fluid File System... and open-source solutions, as well. They're much cheaper to store data, and from a performance perspective, they can offer you a much better price/performance ratio. And ultimately, they're scalable."

Users of commercial software often have data that is partially disposable or has very little post-process requirements, he adds.

Fewer Administrators Required

When deployed correctly, storage virtualization, deduplication, storage tiering and erasure code technologies should reduce your need for administrators, because the tools enable you to manage data through a single pane of glass. In Shutterfly's case, the automated storage infrastructure allowed the company to slow the growth of its maintenance staff. As the company's daily maintenance workload declines, administrators can spend more time on proactive projects.

In some cases, big data projects are done by special teams, not traditional IT staff, Nadkarni says. "They're owned and operated by business units because the IT infrastructure is not agile enough to support big data environments or may not have the skill set for it."

"You might have a situation where storage administrators aren't involved," he adds. "Or they might just have a small role where [they provision] some storage and everything else is done by the systems folks."

Coming Soon

One trend Nadkarni sees catching on is the concept of moving the compute layer to the data. "You look at solutions from Cleversafe and solutions from other storage providers who are building compute capabilities in the storage layer," he says. "It is no longer feasible to move data to where the compute layer sits. It's practically impossible, especially if you only have a few minutes to analyze the data before it gets stale. So why don't I let the compute layer sit where the data sits?"

Cleversafe offers a high-end, Hadoop-based solution for big data heavy-hitters like Shutterfly, "but they're trying to make it more all-purpose," Nadkarni says. "Cleversafe breaks the model of procuring [compute power] from one vendor and app storage from another vendor." To be successful with mainstream enterprises, "business units will have to start thinking in different ways. I'm confident that it will eventually catch on, because the efficiencies in the current model just don't lend themselves to be favorable for big data."

He adds, "Big data is a way for people to maintain their competitive edge. In order to make the most out of their data, they're going to have to change processes and the way they function as a company -- they're going to have to be very quick to derive the value from this data."

But before diving into a new big data storage infrastructure, "people have to do their homework," Csaplar says. "Research it and talk to people who've done it before. It's not cutting-edge, so talk to someone who has already done it so you don't make the same mistakes they've made."

Collett is a Computerworld contributing writer. You can contact her at

Copyright © 2013 IDG Communications, Inc.

1 2 3 Page 3
Page 3 of 3
Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon