The development of the world's largest single-file name data repository could help predict weather and prevent overhyping of hurricanes like Irene.
Forecasters had predicted Irene could devastate cities such as Washington and New York, but instead some of the most severe damage occurred far further inland in states such as Vermont, which was drowned in tropical-storm downpours.
Several post-Hurricane Irene reports pointed to inaccurate forecasts as problematic. As the UK publication, The Guardian, wrote: The "storm surge that could have swamped [Manhattan] failed to materialize." And many New Yorkers were unhappy about having prepared for the worst only to experience little to no damage.
Enter IBM's Data Storage Group at Almaden, Calif., which has proved it can build a 120PB data system by using 200,000 SAS (serial SCSI) drives -- all configured as if it is a single drive under one name. That's roughly 30 times larger than the biggest single data repository on record, according to IBM. The system could store up to 1 trillion files. Even the Wayback Machine, a massive data time capsule created by The Internet Archive to store everything on the Web since 1996, only holds 2PB of data.
IBM said it chose high-performance SAS drives over high-capacity SATA drives because the system has high bandwidth requirements. The drives are also connected via a backbone that uses the SAS (serial SCSI) protocol, but the storage is connected to compute nodes via a proprietary fabric, which IBM would not disclose.
The technology for IBM's massive data store, which the company plans to begin installing in several customer sites later this year, would be ideal for creating more powerful high-performance computing systems that perform tasks such as climate modeling.
To be sure, Hurricane Irene packed plenty of punch. At least 21 people in eight states died as a result of the storm. And early estimates for damage top $7 billion. But most models showed the storm hitting the East Coast with far more force than it did.
"As with any of these high-performance computing simulations ... the more variables you can look at, the more granular you can be, the better the models. Hopefully, the better the model, the better the prediction," said Bruce Hillsberg, director of Storage Systems Research at IBM. While IBM used the weather simulation as an example, it would not say who its customers were for the data store.
IBM's 120PB data store has yet to be built. The company will be assembling it in the data centers of several customers over the next year, but the base technology to build the systems has been around for many years. The technology, IBM's General Parallel File System (GPFS), is already used in a number of IBM products, including its scale-out NAS (SONAS) array, which IBM brought to market last year, and can scale to 14PB of capacity. IBM also uses GPFS in its strategic archive product called the IBM Information Archive, as well as its cloud storage service offerings.
IBM has been using GPFS to build massive data stores since 1998. Back then, the largest single virtual drive was 43TB, a capacity that's easily achieved in a single data center rack today.
In fact, IBM's GPFS technology was the data store behind IBM's Watson supercomputer, which earlier this year demonstrated its processing prowess by handily beating champions of the game show Jeopardy. That system boasted a 21.6TB data store.
It was for that very reason, the massive growth in customer data storage requirements, that IBM built its latest GPFS storage system.