Computerworld - It's a cruel world out there in the data center. Nothing lasts forever, especially not mechanical devices with fast-moving parts, such as disk drives and printers. It would be very useful if we could predict when something might break or, at the very least, determine which of two similar products would be less likely to break in a given period. The answer is MTBF, short for mean time between failures, and the closely related MTTF, short for mean time to failure. Both are measures of reliability that are defined statistically as the number of hours a component, assembly or system will operate before it fails.
MTBF sounds simple: the total time measured divided by the total number of failures observed. For example, let's wring out a new generation of 2.5-in. SCSI enterprise hard drives. We run 15,400 initial units for 1,000 hours each (thus our tests take a little less than six weeks), and we find 11 failures. The MTBF is (15,400 x 1,000) hours/11, or 1.4 million hours. (This is not a hypothetical MTBF; it represents current drive technology in 2005.)
What does this calculation really mean? An MTBF of 1.4 million hours, determined in six weeks of testing, certainly doesn't say we can expect an individual drive to operate for 159 years before failing. MTBF is a statistical measure, and as such, it can't predict anything for a single unit. We can use that MTBF rating more accurately, however, to calculate that if we have 1,000 such drives operating continuously in a data center, we can expect one to fail every 58 days or so, for a total of perhaps 19 failures in three years.
The MTBF figure for a product can be derived from laboratory testing, actual field failure data or prediction models such as MIL-HDBK-217 (the Military Handbook for Reliability Prediction of Electronic Equipment, published by the U.S. Department of Defense).
MIL-HDBK-217 contains failure-rate models for various parts used in electronic systems, such as integrated circuits, transistors, diodes, resistors, capacitors, relays, switches and connectors. These failure-rate models are based on a large amount of field data that was analyzed and simplified by the Reliability Analysis Center and Rome Laboratory at Griffiss Air Force Base in Rome, N.Y. (Instructions for downloading MIL-HDBK-217 are at www.t-cubed.com/faq_217.htm.)
Kay is a Computerworld contributing writer in Worcester, Mass. You can contact him at email@example.com.
See additional Computerworld QuickStudies
Read more about Hardware in Computerworld's Hardware Topic Center.
- An Insightful Approach to Optimizing Mainframe MLC Spend This paper discusses how you can penetrate the complexity of IBM mainframe MLC products and the MLC price model to gain insight into...
- Meeting the Exploding Demand for New IT Services In this eBook, explore the top trends driving the New IT for IT Service Management, and how leading organizations are evolving to focus...
- Hybrid IT-A Low-Risk Path from On-Premise to ITaaS This white paper provides a strategy to move part or all of your ITSM suite to the cloud as a stepping stone to...
- Paving the Windows XP Migration Path to Success Support for Windows XP has ended, leaving organizations with three choices: Windows 8, Windows 7 or a combination. With the right planning and...
- Increase Your Data Center IQ Discover how to improve network efficiency, lower IT costs and more proactively manage your physical, virtual and cloud environments.
- Optimize Data Center Resources and Plan for the Future Eliminate over-provisioning and capacity shortfalls with pro-active capacity optimization. Join us in the evolution from capacity monitoring to capacity optimization in your data... All Hardware White Papers | Webcasts