IBM's 120 petabyte drive could help better predict weather

Massive drive would store up to 1 trillion files (video below)

1 2 Page 2
Page 2 of 2

"We really think that cloud computing and cloud storage could get to these capacity points in coming years. So this research allows us to be ready for that when the market needs it," Hillsberg said.

Challenges with scale-out

While GPFS has been around for years, building a 120PB drive had its challenges, the greatest of which was data integrity, Hillsberg said.

"With 200,000 drives, there are going to be drives failing all the time. So you have to think about it not in terms of trying to improve the failure rates of individual drives, but look at the system as a whole and meantime to data loss," he said, referring to referring to how long the data store will last before it might begin losing information. "So how do you keep the system up and running when you have lots and lots of individual components failing?"

Hillsberg and his team looked at current technologies, such as RAID 6, or dual-drive parity, which offered a meantime to data loss of about 56 years, but it was still too high a probability.

Without giving specifics on the "secret sauce," Hillsberg said his team was able to come up with another scheme that offered up to 1 million years between data loss events.

"It has to do with keeping more copies of data than you would in traditional RAID systems as well as algorithms to recover it. We also have a lot of optimization in there to deal with the rate of recovery to keep it efficient," he said.

In supercomputers, systems typically are limited by the rate at which they can pull data from the storage subsystem. A RAID rebuild can use an enormous amount of CPU capacity, thereby affecting the overall performance of the system.

IBM basically created an algorithm that examines disk failure rates and rebuilds data at different rates depending on how many drives have failed and how many copies of data are available. For example, the systems would react more slowly and use fewer CPU cycles to rebuild a single disk failure than multiple ones.

"If you're seeing one failure with one set of data, you can do that rebuild relatively slowly because we have the data redundancy," he explained. "And, if you have two failures in the same data space, you go faster. If you have three failures, then you go really fast," he said.

Another data center issue the GPFS and data resiliency technology addresses is one affecting the network-attached storage market as a whole: one NAS file server is easy to manage, but 100 NAS arrays aren't.

"We've learned how to build systems that scale out in terms of performance and capacity, but we've been able to keep the management costs flat," Hillsburg said. "We do that through a single name space and single point of management."

Lucas Mearian covers storage, disaster recovery and business continuity, financial services infrastructure and health care IT for Computerworld. Follow Lucas on Twitter at @lucasmearian or subscribe to Lucas's RSS feed . His e-mail address is lmearian@computerworld.com.

Copyright © 2011 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon