Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

Authors:
Bianca Schroeder;Garth A. Gibson
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
ACM Transactions on Storage (TOS)
Year:
2007

Citing 19
Cited 12

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Redundant disk arrays: reliable, parallel secondary storage

Redundant disk arrays: reliable, parallel secondary storage
On the self-similar nature of Ethernet traffic (extended version)

IEEE/ACM Transactions on Networking (TON)
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Long-Range Dependence: Ten Years of Internet Traffic Modeling

IEEE Internet Computing
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introduction to Probability Models, Ninth Edition

Introduction to Probability Models, Ninth Edition
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Reliability analysis of deduplicated and erasure-coded storage

ACM SIGMETRICS Performance Evaluation Review
Reliability prediction for fault-tolerant software architectures

Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Survey and analysis of disk scheduling methods

ACM SIGARCH Computer Architecture News
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Rebuild processing in RAID5 with emphasis on the supplementary parity augmentation method[37]

ACM SIGARCH Computer Architecture News
Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays

Cluster Computing
Hierarchical RAID: Design, performance, reliability, and recovery

Journal of Parallel and Distributed Computing
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
Using dark fiber to displace diesel generators

HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
Anomaly detection of cooling fan and fault classification of induction motor using Mahalanobis-Taguchi system

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Component failure in large-scale IT installations is becoming an ever-larger problem as the number of components in a single cluster approaches a million. This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007] and presents and analyzes field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. More than 110,000 disks are covered by this data, some for an entire lifetime of five years. The data includes drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2--4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. In other words, the replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA drives, potentially an indication that disk-independent factors such as operating conditions affect replacement rates more than component-specific ones. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.