Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

Authors:
Bianca Schroeder;Garth A. Gibson
Affiliations:
Computer Science Department, Carnegie Mellon University;Computer Science Department, Carnegie Mellon University
Venue:
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Year:
2007

Citing 16
Cited 105

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Redundant disk arrays: reliable, parallel secondary storage

Redundant disk arrays: reliable, parallel secondary storage
On the self-similar nature of Ethernet traffic (extended version)

IEEE/ACM Transactions on Networking (TON)
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Networked Windows NT System Field Failure Data Analysis

PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Long-Range Dependence: Ten Years of Internet Traffic Modeling

IEEE Internet Computing
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introduction to Probability Models, Ninth Edition

Introduction to Probability Models, Ninth Edition
Why do internet services fail, and what can be done about it?

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies

An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Improving file system reliability with I/O shepherding

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Hard Disk Drives: The Good, the Bad and the Ugly!

Queue - File Systems and Storage
Cooperative scans: dynamic bandwidth sharing in a DBMS

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
GreenFS: making enterprise computers greener by protecting them better

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Auditing to keep online storage services honest

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
SafeStore: a durable and practical storage system

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Write off-loading: practical power management for enterprise storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Using utility to provision storage systems

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Free factories: unified infrastructure for data intensive web services

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
EED: Energy Efficient Disk drive architecture

Information Sciences: an International Journal
Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
Write off-loading: Practical power management for enterprise storage

ACM Transactions on Storage (TOS)
Friendstore: cooperative online backup using trusted nodes

Proceedings of the 1st Workshop on Social Network Systems
Undetected disk errors in RAID arrays

IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

ACM Transactions on Storage (TOS)
DataSeries: an efficient, flexible data format for structured serial data

ACM SIGOPS Operating Systems Review
Hard-disk drives: the good, the bad, and the ugly

Communications of the ACM - One Laptop Per Child: Vision vs. Reality
Migrating server storage to SSDs: analysis of tradeoffs

Proceedings of the 4th ACM European conference on Computer systems
Smoke and mirrors: reflecting files at a geographically remote location without loss of performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Challenges on preserving scientific data with data grids

Proceedings of the 1st ACM workshop on Data grids for eScience
P-Code: a new RAID-6 code with optimal properties

Proceedings of the 23rd international conference on Supercomputing
R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems

Proceedings of the 23rd international conference on Supercomputing
Modular data centers: how to design them?

Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Understanding intrinsic characteristics and system implications of flash memory based solid state drives

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Energy efficient and reliable storage disks

Proceedings of the 46th Annual Southeast Regional Conference on XX
Upright cluster services

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Assessement of current health of hard disk drives

CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
CLON: Overlay Networks and Gossip Protocols for Cloud Environments

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Ten fallacies of availability and reliability analysis

ISAS'08 Proceedings of the 5th international conference on Service availability
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Optimal recovery of single disk failure in RDP code storage systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Just one bit in a million: on the effects of data corruption in files

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
DARC: design and evaluation of an I/O controller for data protection

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Adaptive system anomaly prediction for large-scale hosting infrastructures

Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Keeping bits safe: how hard can it be?

Communications of the ACM
A reliability model of energy-efficient parallel disk systems with data mirroring

International Journal of High Performance Systems Architecture
Using Cloud Constructs and Predictive Analysis to Enable Pre-Failure Process Migration in HPC Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Towards long term data quality in a large scale biometrics experiment

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Mean time to meaningless: MTTDL, Markov models, and storage system reliability

HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
AmazingStore: available, low-cost online storage service using cloudlets

IPTPS'10 Proceedings of the 9th international conference on Peer-to-peer systems
Keeping Bits Safe: How Hard Can It Be?

Queue - Storage
Wimpy node clusters: what about non-wimpy workloads?

Proceedings of the Sixth International Workshop on Data Management on New Hardware
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DRAM errors in the wild: a large-scale field study

Communications of the ACM
Understanding the relationship between energy conservation and reliability in parallel disk arrays

Journal of Parallel and Distributed Computing
What is the future of disk drives, death or rebirth?

ACM Computing Surveys (CSUR)
A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

ACM Transactions on Storage (TOS)
Understanding network failures in data centers: measurement, analysis, and implications

Proceedings of the ACM SIGCOMM 2011 conference
Towards IT systems capable of managing their health

FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation

ACM Transactions on Storage (TOS)
Spare parts allocation: fuzzy systems approach

Proceedings of the 15th WSEAS international conference on Computers
To cloud or not to cloud?: musings on costs and viability

Proceedings of the 2nd ACM Symposium on Cloud Computing
PREFAIL: a programmable tool for multiple-failure injection

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Towards reliable storage systems

Towards reliable storage systems
HPDA: A hybrid parity-based disk array for enhanced performance and reliability

ACM Transactions on Storage (TOS)
Job failures in high performance computing systems: A large-scale empirical study

Computers & Mathematics with Applications
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Parameterized reliability prediction for component-based software architectures

QoSA'10 Proceedings of the 6th international conference on Quality of Software Architectures: research into Practice - Reality and Gaps
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

ACM Transactions on Storage (TOS)
Data storage auditing service in cloud computing: challenges, methods and opportunities

World Wide Web
Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Gibraltar: A Reed-Solomon coding library for storage applications on programmable graphics processors

Concurrency and Computation: Practice & Experience
Temperature management in data centers: why some (might) like it hot
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Efficient cooperative backup with decentralized trust management

ACM Transactions on Storage (TOS)
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array

ACM Transactions on Storage (TOS)
Finding soon-to-fail disks in a haystack

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Understanding data survivability in archival storage systems

Proceedings of the 5th Annual International Systems and Storage Conference
GANGRENE: exploring the mortality of flash memory

HotSec'12 Proceedings of the 7th USENIX conference on Hot Topics in Security
New approaches to security and availability for cloud data

Communications of the ACM
IDO: intelligent data outsourcing with improved RAID reconstruction performance in large-scale data centers

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems

ACM Transactions on Storage (TOS)
Analysis for REPERA: A Hybrid Data Protection Mechanism in Distributed Environment

International Journal of Cloud Applications and Computing
Using unreliable virtual hardware to inject errors in extreme-scale systems

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Building intelligence for software defined data centers: modeling usage patterns

Proceedings of the 6th International Systems and Storage Conference
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage (TOS)
Limplock: understanding the impact of limpware on scale-out cloud systems

Proceedings of the 4th annual Symposium on Cloud Computing
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing
A solution to the network challenges of data recovery in erasure-coded distributed storage systems: a study on the Facebook warehouse cluster

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Making problem diagnosiswork for large-scale, production storage systems

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Petri nets extension to model state-varying failure rates

Proceedings of the 2013 Summer Computer Simulation Conference
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems

ACM Transactions on Storage (TOS)
Estimating sustainability impact of high dependable data centers: a comparative study between Brazilian and US energy mixes

Computing
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

ACM Transactions on Storage (TOS)
SD codes: erasure codes designed for how storage systems really fail

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Understanding the robustness of SSDS under power fault

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.05

Visualization

Abstract

Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wearout degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.