Big data storage doesn't have to break the bank

The era of big data requires new storage strategies. And with faster and smarter technology, these approaches don't have to break the bank.

1 2 3 4 Page 2
Page 2 of 4

How do you build a data storage strategy in the era of big data, scale your storage architecture to keep pace with data and business growth, and keep storage costs under control? Find out from big data veterans who share their storage sagas and explain how they have reinvented their storage strategies.

Lower-End Storage Does the Trick

In close political races, data can make a difference. Just ask the folks at Catalist. A Washington-based political consultancy, Catalist stores and mines data on 190 million registered voters and 90 million unregistered voters -- including almost a billion "observations" of people based on pubic records such as real estate transactions or requests for credit reports. The information produced from its analytics tools tells campaign organizers whose door to knock on and can even prompt candidates to change their voter strategies overnight.

"We used to have a big EMC storage system that we retired a while back just because it was so expensive and consumed so much power," says Catalist CTO Jeff Crigler, noting that the EMC system also ran out of space. So the firm built a cluster of NAS servers that each hold about a petabyte of data. "It's essentially a big box of disks with a processor that's smart enough to make it act like an EMC-like solution" with high-density disk drives, some "fancy" configuration software and very modest CPU to run the configuration software.

Csaplar sees a growing trend away from expensive storage boxes that can cost more than $100,000 and toward lower-cost servers that are now capable of doing more work. "As servers get more powerful," he says, "they take over some of the work that you used to have specialized appliances do." It's similar to the way networking has evolved from network-attached hubs to a NIC card on the back of the server to functionality residing on silicon as part of the CPU, he adds.

"I believe that storage is moving this way as well," says Csaplar. Instead of buying big expensive storage arrays, he says, companies are taking the JBOD (just a bunch of disks) approach -- using nonintelligent devices for storage and using the compute capacity of the servers to manage it. "This lowers the overall cost of the storage, and you don't really lose any functionality -- or maybe it does 80% of the job at 20% of the cost," he notes.

Catalist replaced its "$100,000 and up" boxes with four NAS storage units at a cost of $40,000. "We quadrupled our capacity for about $10,000 each," Crigler says. "That was a year and a half ago," and the cost of storage has continued to go down.

Csaplar says he expects to see more lower-end storage systems on the market as more organizations find that they meet their needs. Big vendors like EMC see the writing on the wall and have been buying up smaller, boutique storage companies, he adds.

The Storage and Processing Gap

Data analytics workflow tools are allowing stored data to sit even closer to analytics tools, while their file compression capabilities keep storage needs under control. Vendors such as Hewlett-Packard's Vertica unit, for instance, have in-database analytics functionality that lets companies conduct analytics computations without the need to extract information to a separate environment for processing. EMC's Greenplum unit offers similar features. Both are part of a new generation of columnar databases, which are designed to offer significantly better performance, I/O, storage footprint and efficiency than row-based databases when it comes to analytic workloads. (In April, Greenplum became part of Pivotal Labs, an enterprise platform-as-a-service company that EMC acquired in March.)

Catalist opted for a Vertica database specifically for those features, Crigler says. Because the database is columnar rather than row-based, it looks at the cardinality of the data in the column and can compress it based on that. Cardinality describes the relationship of one data table to another, comparing one-to-many or many-to-many.

1 2 3 4 Page 2
Page 2 of 4
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon