Hard Data
Computerworld -
No theory is ever as good as lots of real-world data. So here, based on lots of real-world data, is what you should do to minimize problems with hard disk drives: a) burn them in rigorously; b) replace them as soon as they start throwing errors, especially scan errors; and c) retire them before they turn three years old. Oh, and d) remember that none of those measures is a substitute for regular backups.
That’s the gist of a pair of amazing studies presented at the FAST ’07 storage conference this month. Two separate research groups each collected data on 100,000 disk drives, some of which failed — then they crunched the numbers to identify how the drives failed, what they (mainly) failed from and what they (mostly) didn’t fail from.
And ho boy, do they ever fail. Hard drives are the most commonly replaced hardware item in many data centers, and they account for 16% of all hardware-related outages. Anything that tells us how to keep them from dropping dead is money in the bank for IT shops.
One of the studies, from Carnegie Mellon University, got its statistics from a wide range of sites, including the Los Alamos National Laboratory, the Pittsburgh Supercomputing Center and various Internet service providers. (You can find that study online at www.usenix.org/events/fast07/tech/schroeder.html.)
The other study sifted through data from Google’s automated system for tracking performance of drives in its own huge storage farms. That one’s at http://labs.google.com/papers/disk_failures.pdf.
If those two populations sound very much alike — well, listen harder. High-performance computing centers tend to buy gear with high-performance specs. Google, on the other hand, is notoriously cheap when it comes to hardware — it buys garden-variety hard drives in large lots from whoever is offering the best deal that particular week.
But it turns out that high-end and consumer drives have a lot in common. For one thing, they typically don’t last the five years that drive vendors say they should, at least not in server-farm settings. Drive failures at Google take a big jump once drives get to be more than two years old. And according to the Carnegie Mellon team, those rising failure rates never level off — they just keep going up as drives get older.
Think using a drive a lot will make it much more likely to fail? Nope, say the guys from Google. Low-utilization drives fail at almost exactly the same rate as high-utilization drives.
Think RAID is a guarantee against a storage catastrophe? Don’t believe it, say the Carnegie Mellon folks. According to their real-world data, in RAID 5 arrays, when one drive fails, another drive failure will often happen much sooner than it theoretically should — maybe even before you’ve replaced the bad drive and rebuilt the data set on the RAID array.
disk drive failure
Additional Resources



Learn the important issues you must consider before starting your next mobility initiative. Get your mobility white paper from IDC now, compliments of Sybase.
White Papers & Webcasts
Extending Client Refresh - 11 Steps to Maximize Savings
Register Now!
3 Minutes with Free Tool Can Save Thousands!
Register Now!
Consolidate Your Servers and Storage to Lower Costs with Oracle Database 11g
Register for this webcast!
Looking for a fast payback?
Register Now!
The Commercialization of ITIL: Lessons Learned
Register for this event today!
Business Continuity - Are You Always Open for Business?
Download Now!
Key Findings: Accelerating ROI with BPM
Click here to watch now!
