Vendor disk failure rates: Myth or metric?

Disk problems contribute to 20% to 55% of storage subsystem failures

The statistics of mean time between failures (MTBF) and average failure rate (AFR) have gotten lots of attention lately in the storage world, especially with the release of three much-discussed studies devoted to the topic in the last year. And for good reason: Vendor-stated MTBFs have risen into the 1 million-to-1.5 million-hour range, equaling 114 to 170 years, a lifespan that no one is seeing in the real world.

Three studies over the past year on MTBF include the following:

"MTBF is a term that's in growing disrepute inside the industry because people don't understand what the numbers mean," says Robin Harris, an analyst at Data Mobility Group who also runs the StorageMojo blog. "Your average consumer and a lot of server administrators don't really get why vendors say a disk has a 1 million-hour MTBF, and yet it doesn't last that long."

Indeed, "how do these numbers help a person who wants to evaluate drives?" says Steve Smith, a former EMC Corp. employee and an independent management consultant in Bellevue, Wash. "I don't think they can.

Even storage system maker NetApp Inc. acknowledges in a response to an open letter on the StorageMojo blog that failure rates are several times higher than reported. "Most experienced storage array customers have learned to equate the accuracy of quoted drive-failure specs to the miles-per-gallon estimates reported by car manufacturers," the company says. "It's a classic case of 'Your mileage may vary' -- and often will -- if you deploy these disks in anything but the mildest of evaluation/demo lab environments."

Study results

The upshot of the recent studies can be summarized this way: Users and vendors live in very different worlds when it comes to disk reliability and failure rates.

Consider that MTBF is a figure that's reached through stress-testing and statistical extrapolation, Harris says. "When the vendor specs a 300,000-hour MTBF -- which is common for consumer-level SATA drives -- they're saying that for a large population of drives, half will fail in the first 300,000 hours of operation," he says on his blog. "MTBF, therefore, says nothing about how long any particular drive will last." In other words, MTBF does a very poor job communicating what the actual failure profile looks like, he says.

It's like providing the average woman's height in the U.S. but without showing the numbers used to derive that average, Smith says. "MTBF became the standard because it was perceived as a simpler answer to the question of reliability than showing the data of how they arrived at it," Smith says. "It's an honest-to-God simplification."

Stan Zaffos, an analyst at Gartner Inc., agrees. While he believes MTBF is an accurate representation of what the vendors are experiencing with the technology they're shipping, it's also difficult to translate into something meaningful to end users. "It's a very complex and tortuous route to undertake, requiring a lot of solid engineering experience and an understanding of probability and statistics," he says.

According to Harris, the industry has tried to be less misleading by using AFR instead of MTBF "People want to know, in a given year, what percentage of drives they can expect to fail," says Bianca Schroeder, a co-author of the Carnegie Mellon study.

However, according to the study, the rate of disk replacements is far higher than the AFR percentages provided by vendors. While vendors' data sheets show AFRs between 0.58% and 0.88%, the study found average replacement rates typically exceeding 1%, with 2% to 4% common and up to 13% observed on some systems. The study gathered the disk-replacement data of a number of large production systems, for a total of 100,000 SCSI, Fiber Channel and SATA disks.

The study also found that replacement rates grew constantly with age, which counters the usual common understanding that drive degradation sets in after a nominal lifetime of five years, Schroeder says.

One explanation for this is that the study looked at how many drives were replaced, while AFR describes how many disks are expected to fail, and as Schroeder explains, "failure and replacement are not always the same thing." For one thing, users might proactively replace a drive that's just starting to act strange but is actually still functional.

A second factor lies in the vendors' testing environments, which are optimal compared with actual user environments, Schroeder says. Zaffos points out that there are lots of transient events that could cause a failure in a user's environment, including vibration, power surges, dust and humidity levels.

A third disconnect lies in the definition of a failure. "Vendors define failure differently than you and I do, and their definition makes drives look more reliable than what you and I see," Harris says on his blog. That's because when disk drive vendors get a drive returned to them marked "bad," they plug it into their test equipment, and if they find no problem with it, they dub it "no trouble found," or NTF, increasing the reliability measure of returned disks. In fact, vendors typically report "no trouble found" with 20% to 30% of all returned drives, he says. But, Harris says, you can take that same disk and plug it back into the user's server, and it won't work. Of course, to the user, it's still a bad disk that needs to be replaced. In fact, both versions of the truth can be valid at the same time.

As the Google study found, disk failures are sometimes the result of a combination of components, like a particular drive with a particular controller or cable. "A good number of drives could be still considered operational in a different test harness," the study says. "We have observed situations where a drive tester consistently 'green lights' a unit that invariably fails in the field."

The University of Illinois study verifies that finding. Although disks contribute to 20% to 55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for significant percentages -- 27% to 68% for physical interconnects and 5% to 10% for protocol stacks. The study used real-world field data from NetApp, analyzing the error logs collected from about 39,000 commercially deployed storage systems. The data set included 1.8 million disks hosted in about 155,000 storage-shelf enclosures.

This finding is important, the study's authors say, because it will lead the industry to consider other factors when designing reliable storage systems. Such factors include selecting more reliable disk models and shelf enclosures, as well as employing redundancy mechanisms to tolerate component failures, like multipathing, or configuring the storage subsystem with two independent interconnects rather than a single interconnect.

Other statistics

Yet another reliability statistic that's bandied about is mean time to data loss, which is a measurement derived from MTBF. MTDL is used by vendors of storage subsystems, not drive manufacturers, and it takes into account the number of disks involved, resiliency, rebuild time and the amount and type of redundancy offered.

This is a useful number, Harris says, but it's still based more on theory than actuality. In fact, he says, the University of Illinois study calls one of its MTDL's tenets into question. It found that each type of storage subsystem failure exhibits strong correlations; that is, after one failure, the probability of additional failures of the same type is higher, and the failures are likely to happen relatively close together in time.

"Most of the theoretical numbers that people use for mean time to data loss are based on the idea that failures are random, but they aren't," Harris says. "The failures are fairly highly correlated, so the theoretical calculation doesn't match what's observed in the field."

All those numbers aside

Perhaps the issue that the industry is having the most difficulty communicating, Harris says, is that disk drives are mechanical devices, and as such, they wear out. In fact, looking at the Google and Carnegie Mellon studies, once a drive reaches three years of age, its AFR starts rising, he says.

"This is something people with big disk farms have known intuitively for years," Harris says. "If you're just buying a couple hundred disk drives, the number that's important to you is that after about three years, you should be thinking about replacing your drives," depending on how risk-averse you are, how valuable your data is and how good your backups are.

The trouble is, vendor marketing teams need to figure out how to convey that. "It's not an easy problem from a perception standpoint," Harris says.

The three-year mark closely correlates with typical disk warranties, Smith points out. "Although MTBF is supposed to be 136 years, that's way past the warranty on these drives," he says.

Some say the only reason the measurement hasn't been ditched yet is that there's no good alternative. "Should we throw out MTBF? Well, what are we going to replace it with?" Smith asks. There's got to be some way, he says, to show growth in reliability and to distinguish among drives intended to be more reliable than others.

"I think most people would like to have a different measure, but it's hard to come up with what a better metric should be," Schroeder notes.

Smith claims that vendors do have unambiguous MTBF numbers, but they choose not to share them. "The people that know the most about this are the biggest disk array manufacturers," Smith says. "But I don't think they're going to be forthright with giving people that data because it would reduce the opportunity for them to add value by 'interpreting' the number."

There's also the idea of collecting and using field data. But even that's problematic, Schroeder says. For one thing, the study showed disk failures increasing significantly as the disk aged. So data gathered from disks that are, say, one year old, would not apply to others that were a different age, she points out.

"Maybe you'd need an AFR for each year that the drive is in use, but even so, there are so many other factors, such as operating conditions and workload," Schroeder says. "There are so many factors that impact drive reliability that it would be hard to come up with a realistic model." However, she agrees that it would be helpful for vendors to provide more data, such as field replacement rates and number of latent sector errors. Carnegie Mellon is working with Usenix to create a failure data repository for drives of various types, ages and capacities.

What affects customers even more than disk reliability, Smith says, is batches of bad drives, especially in light of the Google study's finding that one drive failure highly correlates to other failures.

"That's why it's so important that vendors put an infrastructure in place that allows them to do physical analysis of the installed base to find out if there's a systemic problem -- a microcode bug or a bad batch of components," Zaffos says. "That's what makes it important to have a mature service/support organization that is able to track history and look for patterns."

Is MTBF relevant?

Have we reached a point where the disk drives are so reliable that we don't need to concern ourselves with numbers like MTBF or AFR? "I don't know the answer," Smith says. But he acknowledges that he himself doesn't look carefully at MTBF on drives before he buys them.

"I believe down to my bone marrow that the MTBF on these drives is so high that I don't have to worry about it much," Smith says. "Do you make a distinction between someone's numbers that are 10% different? What's the difference between a million hours and 1.5 million?"

At the same time, the more the industry learns about what makes one storage system more reliable than others will only enhance the way these systems are architected. Reliability has greatly improved since the days when vendors really did have to prove that their disks were trustworthy, but the degree to which businesses rely on these components has also increased exponentially. While 25 years ago, a 25,000-hour MTBF was seen as pretty good, compared with today's million-hour numbers, "the trouble seems to be growing faster than their reliability," Harris says.

To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
Windows 10 annoyances and solutions
Shop Tech Products at Amazon
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.