Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Redundant disk arrays: reliable, parallel secondary storage
Redundant disk arrays: reliable, parallel secondary storage
On the self-similar nature of Ethernet traffic (extended version)
IEEE/ACM Transactions on Networking (TON)
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Long-Range Dependence: Ten Years of Internet Traffic Modeling
IEEE Internet Computing
Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Introduction to Probability Models, Ninth Edition
Introduction to Probability Models, Ninth Edition
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Using queue structures to improve job reliability
Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Improving file system reliability with I/O shepherding
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Hard Disk Drives: The Good, the Bad and the Ugly!
Queue - File Systems and Storage
Cooperative scans: dynamic bandwidth sharing in a DBMS
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
GreenFS: making enterprise computers greener by protecting them better
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Auditing to keep online storage services honest
HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
SafeStore: a durable and practical storage system
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Write off-loading: practical power management for enterprise storage
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Using utility to provision storage systems
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Free factories: unified infrastructure for data intensive web services
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
EED: Energy Efficient Disk drive architecture
Information Sciences: an International Journal
ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack
ACM Transactions on Storage (TOS)
Write off-loading: Practical power management for enterprise storage
ACM Transactions on Storage (TOS)
Friendstore: cooperative online backup using trusted nodes
Proceedings of the 1st Workshop on Social Network Systems
Undetected disk errors in RAID arrays
IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems
ACM Transactions on Storage (TOS)
DataSeries: an efficient, flexible data format for structured serial data
ACM SIGOPS Operating Systems Review
Hard-disk drives: the good, the bad, and the ugly
Communications of the ACM - One Laptop Per Child: Vision vs. Reality
Migrating server storage to SSDs: analysis of tradeoffs
Proceedings of the 4th ACM European conference on Computer systems
Smoke and mirrors: reflecting files at a geographically remote location without loss of performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
Challenges on preserving scientific data with data grids
Proceedings of the 1st ACM workshop on Data grids for eScience
P-Code: a new RAID-6 code with optimal properties
Proceedings of the 23rd international conference on Supercomputing
R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems
Proceedings of the 23rd international conference on Supercomputing
Modular data centers: how to design them?
Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
DRAM errors in the wild: a large-scale field study
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Energy efficient and reliable storage disks
Proceedings of the 46th Annual Southeast Regional Conference on XX
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Assessement of current health of hard disk drives
CASE'09 Proceedings of the fifth annual IEEE international conference on Automation science and engineering
CLON: Overlay Networks and Gossip Protocols for Cloud Environments
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Ten fallacies of availability and reliability analysis
ISAS'08 Proceedings of the 5th international conference on Service availability
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Optimal recovery of single disk failure in RDP code storage systems
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Just one bit in a million: on the effects of data corruption in files
ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
DARC: design and evaluation of an I/O controller for data protection
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Keeping bits safe: how hard can it be?
Communications of the ACM
A reliability model of energy-efficient parallel disk systems with data mirroring
International Journal of High Performance Systems Architecture
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Towards long term data quality in a large scale biometrics experiment
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A clean-slate look at disk scrubbing
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings
HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Mean time to meaningless: MTTDL, Markov models, and storage system reliability
HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
AmazingStore: available, low-cost online storage service using cloudlets
IPTPS'10 Proceedings of the 9th international conference on Peer-to-peer systems
Keeping Bits Safe: How Hard Can It Be?
Queue - Storage
Wimpy node clusters: what about non-wimpy workloads?
Proceedings of the Sixth International Workshop on Data Management on New Hardware
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
DRAM errors in the wild: a large-scale field study
Communications of the ACM
Understanding the relationship between energy conservation and reliability in parallel disk arrays
Journal of Parallel and Distributed Computing
What is the future of disk drives, death or rebirth?
ACM Computing Surveys (CSUR)
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations
ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems
ACM Transactions on Storage (TOS)
Understanding network failures in data centers: measurement, analysis, and implications
Proceedings of the ACM SIGCOMM 2011 conference
Towards IT systems capable of managing their health
FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation
ACM Transactions on Storage (TOS)
Spare parts allocation: fuzzy systems approach
Proceedings of the 15th WSEAS international conference on Computers
To cloud or not to cloud?: musings on costs and viability
Proceedings of the 2nd ACM Symposium on Cloud Computing
PREFAIL: a programmable tool for multiple-failure injection
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
On the duality of data-intensive file system design: reconciling HDFS and PVFS
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Towards reliable storage systems
Towards reliable storage systems
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
ACM Transactions on Storage (TOS)
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
GHOST: GPGPU-offloaded high performance storage I/O deduplication for primary storage system
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Parameterized reliability prediction for component-based software architectures
QoSA'10 Proceedings of the 6th international conference on Quality of Software Architectures: research into Practice - Reality and Gaps
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories
ACM Transactions on Storage (TOS)
Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Concurrency and Computation: Practice & Experience
Early accurate results for advanced analytics on MapReduce
Proceedings of the VLDB Endowment
Efficient cooperative backup with decentralized trust management
ACM Transactions on Storage (TOS)
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array
ACM Transactions on Storage (TOS)
Finding soon-to-fail disks in a haystack
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Understanding data survivability in archival storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
GANGRENE: exploring the mortality of flash memory
HotSec'12 Proceedings of the 7th USENIX conference on Hot Topics in Security
New approaches to security and availability for cloud data
Communications of the ACM
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
ACM Transactions on Storage (TOS)
Analysis for REPERA: A Hybrid Data Protection Mechanism in Distributed Environment
International Journal of Cloud Applications and Computing
Using unreliable virtual hardware to inject errors in extreme-scale systems
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Robustness in the Salus scalable block store
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Building intelligence for software defined data centers: modeling usage patterns
Proceedings of the 6th International Systems and Storage Conference
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures
ACM Transactions on Storage (TOS)
Limplock: understanding the impact of limpware on scale-out cloud systems
Proceedings of the 4th annual Symposium on Cloud Computing
When the network crumbles: an empirical study of cloud network failures and their impact on services
Proceedings of the 4th annual Symposium on Cloud Computing
HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems
Making problem diagnosiswork for large-scale, production storage systems
LISA'13 Proceedings of the 27th international conference on Large Installation System Administration
Petri nets extension to model state-varying failure rates
Proceedings of the 2013 Summer Computer Simulation Conference
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems
ACM Transactions on Storage (TOS)
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation
ACM Transactions on Storage (TOS)
SD codes: erasure codes designed for how storage systems really fail
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Understanding the robustness of SSDS under power fault
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.05 |
Component failure in large-scale IT installations is becoming an ever larger problem as the number of components in a single cluster approaches a million. In this paper, we present and analyze field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. About 100,000 disks are covered by this data, some for an entire lifetime of five years. The data include drives with SCSI and FC, as well as SATA interfaces. The mean time to failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that, rather than a significant infant mortality effect, we see a significant early onset of wearout degradation. That is, replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.