Failure trends in a large disk drive population

Authors:
Eduardo Pinheiro;Wolf-Dietrich Weber;Luiz André Barroso
Affiliations:
Google Inc.;Google Inc.;Google Inc.
Venue:
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Year:
2007

Citing 0
Cited 95

An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

ACM Transactions on Storage (TOS)
Hard Disk Drives: The Good, the Bad and the Ugly!

Queue - File Systems and Storage
Moobi: a thin server management system using BitTorrent

LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Auditing to keep online storage services honest

HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
SafeStore: a durable and practical storage system

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Write off-loading: practical power management for enterprise storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Using utility to provision storage systems

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Intra-disk Parallelism: An Idea Whose Time Has Come

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Free factories: unified infrastructure for data intensive web services

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
Write off-loading: Practical power management for enterprise storage

ACM Transactions on Storage (TOS)
Undetected disk errors in RAID arrays

IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

ACM Transactions on Storage (TOS)
To prevent them from entering, provide the keys

International Journal of Information Technology and Management
DataSeries: an efficient, flexible data format for structured serial data

ACM SIGOPS Operating Systems Review
Hard-disk drives: the good, the bad, and the ugly

Communications of the ACM - One Laptop Per Child: Vision vs. Reality
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Architecture of the internet archive

SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
P-Code: a new RAID-6 code with optimal properties

Proceedings of the 23rd international conference on Supercomputing
Modular data centers: how to design them?

Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Dynamic cost-efficient replication in data clouds

ACDC '09 Proceedings of the 1st workshop on Automated control for datacenters and clouds
Energy efficient and reliable storage disks

Proceedings of the 46th Annual Southeast Regional Conference on XX
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Tolerating hardware device failures in software

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Upright cluster services

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Experience with BXGrid: a data repository and computing grid for biometrics research

Cluster Computing
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
A self-organized, fault-tolerant and scalable replication scheme for cloud storage

Proceedings of the 1st ACM symposium on Cloud computing
Utility-function-driven energy-efficient cooling in data centers

Proceedings of the 7th international conference on Autonomic computing
Optimal recovery of single disk failure in RDP code storage systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DARC: design and evaluation of an I/O controller for data protection

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
The impact of management operations on the virtualized datacenter

Proceedings of the 37th annual international symposium on Computer architecture
Keeping bits safe: how hard can it be?

Communications of the ACM
Towards long term data quality in a large scale biometrics experiment

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mean time to meaningless: MTTDL, Markov models, and storage system reliability

HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Keeping Bits Safe: How Hard Can It Be?

Queue - Storage
Predicting disk failures with HMM- and HSMM-based approaches

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Generating client workloads and high-fidelity network traffic for controllable, repeatable experiments in computer security

RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
Challenges in building scalable virtualized datacenter management

ACM SIGOPS Operating Systems Review
Understanding the relationship between energy conservation and reliability in parallel disk arrays

Journal of Parallel and Distributed Computing
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Depot: cloud storage with minimal trust

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Using syslog message sequences for predicting disk failures

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Chukwa: a system for reliable large-scale log collection

LISA'10 Proceedings of the 24th international conference on Large installation system administration
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

ACM Transactions on Storage (TOS)
Efficiently identifying working sets in block I/O streams

Proceedings of the 4th Annual International Conference on Systems and Storage
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation

ACM Transactions on Storage (TOS)
Thialfi: a client notification service for internet-scale applications

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
PREFAIL: a programmable tool for multiple-failure injection

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
On the duality of data-intensive file system design: reconciling HDFS and PVFS

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Depot: Cloud Storage with Minimal Trust

ACM Transactions on Computer Systems (TOCS)
Towards reliable storage systems

Towards reliable storage systems
Experimental study of resilient algorithms and data structures

SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis

Proceedings of the 7th ACM european conference on Computer Systems
Scalable testing of file system checkers

Proceedings of the 7th ACM european conference on Computer Systems
Resilient algorithms and data structures

CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories

ACM Transactions on Storage (TOS)
Achieving power-efficiency in clusters without distributed file system complexity

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Optimizing NAND flash-based SSDs via retention relaxation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Gibraltar: A Reed-Solomon coding library for storage applications on programmable graphics processors

Concurrency and Computation: Practice & Experience
Temperature management in data centers: why some (might) like it hot
VM aware journaling: improving journaling file system performance in virtualization environments

Software—Practice & Experience
Efficient cooperative backup with decentralized trust management

ACM Transactions on Storage (TOS)
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array

ACM Transactions on Storage (TOS)
Finding soon-to-fail disks in a haystack

HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Understanding data survivability in archival storage systems

Proceedings of the 5th Annual International Systems and Storage Conference
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Themis: an I/O-efficient MapReduce

Proceedings of the Third ACM Symposium on Cloud Computing
IDO: intelligent data outsourcing with improved RAID reconstruction performance in large-scale data centers

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Developing a power measurement framework for cyber defense

Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Theorem-based, data-driven, cyber event detection

Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Improving availability in distributed systems with failure informers

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures

ACM Transactions on Storage (TOS)
Thermal Modeling of Hybrid Storage Clusters

Journal of Signal Processing Systems
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing
Resource failures risk assessment modelling in distributed environments

Journal of Systems and Software
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation

ACM Transactions on Storage (TOS)
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.