How to manage big data overload

Complex requirements and relentless demands for capacity vex storage administrators. Here's how to handle the data deluge.

It used to be only for scientists, Internet giants and the mega-social-media set -- Amazon, Twitter, Facebook, Shutterfly. But now, more and more enterprises of all kinds are aiming to gain a competitive edge by tapping into big data in hopes of unearthing the valuable information it can hold. Today, companies such as Walmart, Campbell Soup, Pfizer, Merck and convenience store chain Wawa have big plans for their big data.

Some are venturing into big data analytics to respond to customers faster, keep better track of customer information or get new products to market quicker.

"Any business in this Internet Age, if they don't do it, their competition is going to do it," says Ashish Nadkarni, a storage analyst at IDC.

Organizations of all sizes are being inundated by data, from both internal and external sources. Much of that data is streaming in real time -- and much of it is rendered obsolete in minutes, hours or a few days.

The resulting growth of storage needs is especially troubling for large enterprises, where the amount of structured and unstructured data requiring storage grew an average of 44% from 2010 to 2011, according to Aberdeen Group. At companies of all sizes, data storage requirements are doubling every 2.5 years. What's more, different tools are required to optimize the storage of video, spreadsheets, formatted databases and fully unstructured data.

"The challenge is to try to keep your spending on storage from being linear with your rising storage requirements," says Dick Csaplar, a virtualization and storage analyst at Aberdeen. Technologies that can help mainstream users of big data avoid that fate include storage virtualization, deduplication and storage tiering. For heavy-hitters, such as scientists, social media websites and simulation developers, object-oriented and relational database storage are the best options.

But the nuts and bolts of systems designed to hold petabytes (and more) of data in an easily accessible format are more complex than the inner workings of everyday storage platforms. Here's some expert advice on managing and storing big data.

What Kind of Data Are You Analyzing?

The type of storage required depends on the type and amount of data you're analyzing. All data has a shelf life. A stock quote, for example, is only relevant for a minute or two before the price changes. A baseball score is sought after for about 24 hours, or until the next game. This type of data needs to reside in primary storage when it is most in demand and can then be moved to cheaper storage. A look at trends over multiple years supports the idea that data stored for long periods of time usually doesn't need to be on easily accessible primary drives.

How Much Storage Do You Really Need?

The amount and type of storage you need for big data depends on both the amount of data you need to store and how long that data will remain useful.

There are three types of data involved in big data analytics, Nadkarni says. "It could be streaming data from multiple sources being sent to you literally every second, and your time slice is a few minutes before that data becomes old," he says. This kind of data includes updates on weather, traffic, trending topics on social networks and tweets about events around the world.

Big data can also include data at rest or data generated and controlled by the business for moderate use.

Streaming data requires only fast capture and analytics capabilities, Nadkarni says. "Once you've analyzed it, you don't need it anymore." But for data at rest or business-controlled data, "it is incumbent upon you to store it," he says.

What Type of Storage Tools Work Best?

For enterprises just starting to grapple with big data storage and analysis, industry watchers advocate storage virtualization to get all storage under one umbrella, deduplication to compress data and a tiered storage approach to ensure that the most valuable data is kept in the most easily accessible systems.

Storage virtualization provides an abstraction layer of software that hides physical devices from the user and allows all devices to be managed as a single pool. While server virtualization is a well-established component of today's IT infrastructures, storage virtualization has yet to catch on.

1 2 3 Page 1
Page 1 of 3
It’s time to break the ChatGPT habit
Shop Tech Products at Amazon