An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Using queue structures to improve job reliability
Proceedings of the 16th international symposium on High performance distributed computing
Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?
ACM Transactions on Storage (TOS)
Hard Disk Drives: The Good, the Bad and the Ugly!
Queue - File Systems and Storage
Moobi: a thin server management system using BitTorrent
LISA'07 Proceedings of the 21st conference on Large Installation System Administration Conference
Auditing to keep online storage services honest
HOTOS'07 Proceedings of the 11th USENIX workshop on Hot topics in operating systems
SafeStore: a durable and practical storage system
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Write off-loading: practical power management for enterprise storage
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Using utility to provision storage systems
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Intra-disk Parallelism: An Idea Whose Time Has Come
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Free factories: unified infrastructure for data intensive web services
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
ACM Transactions on Storage (TOS)
An analysis of data corruption in the storage stack
ACM Transactions on Storage (TOS)
Write off-loading: Practical power management for enterprise storage
ACM Transactions on Storage (TOS)
Undetected disk errors in RAID arrays
IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems
ACM Transactions on Storage (TOS)
To prevent them from entering, provide the keys
International Journal of Information Technology and Management
DataSeries: an efficient, flexible data format for structured serial data
ACM SIGOPS Operating Systems Review
Hard-disk drives: the good, the bad, and the ugly
Communications of the ACM - One Laptop Per Child: Vision vs. Reality
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
Architecture of the internet archive
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
P-Code: a new RAID-6 code with optimal properties
Proceedings of the 23rd international conference on Supercomputing
Modular data centers: how to design them?
Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Dynamic cost-efficient replication in data clouds
ACDC '09 Proceedings of the 1st workshop on Automated control for datacenters and clouds
Energy efficient and reliable storage disks
Proceedings of the 46th Annual Southeast Regional Conference on XX
Higher reliability redundant disk arrays: Organization, operation, and coding
ACM Transactions on Storage (TOS)
Tolerating hardware device failures in software
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
A self-organized, fault-tolerant and scalable replication scheme for cloud storage
Proceedings of the 1st ACM symposium on Cloud computing
Utility-function-driven energy-efficient cooling in data centers
Proceedings of the 7th international conference on Autonomic computing
Optimal recovery of single disk failure in RDP code storage systems
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DARC: design and evaluation of an I/O controller for data protection
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
The impact of management operations on the virtualized datacenter
Proceedings of the 37th annual international symposium on Computer architecture
Keeping bits safe: how hard can it be?
Communications of the ACM
Towards long term data quality in a large scale biometrics experiment
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
A clean-slate look at disk scrubbing
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings
HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Hunting for problems with Artemis
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Predicting computer system failures using support vector machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Mean time to meaningless: MTTDL, Markov models, and storage system reliability
HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Keeping Bits Safe: How Hard Can It Be?
Queue - Storage
Predicting disk failures with HMM- and HSMM-based approaches
ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
Challenges in building scalable virtualized datacenter management
ACM SIGOPS Operating Systems Review
Understanding the relationship between energy conservation and reliability in parallel disk arrays
Journal of Parallel and Distributed Computing
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Depot: cloud storage with minimal trust
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Using syslog message sequences for predicting disk failures
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Chukwa: a system for reliable large-scale log collection
LISA'10 Proceedings of the 24th international conference on Large installation system administration
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations
ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems
ACM Transactions on Storage (TOS)
Efficiently identifying working sets in block I/O streams
Proceedings of the 4th Annual International Conference on Systems and Storage
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation
ACM Transactions on Storage (TOS)
Thialfi: a client notification service for internet-scale applications
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
PREFAIL: a programmable tool for multiple-failure injection
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
On the duality of data-intensive file system design: reconciling HDFS and PVFS
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Depot: Cloud Storage with Minimal Trust
ACM Transactions on Computer Systems (TOCS)
Towards reliable storage systems
Towards reliable storage systems
Experimental study of resilient algorithms and data structures
SEA'10 Proceedings of the 9th international conference on Experimental Algorithms
Energy efficiency for large-scale MapReduce workloads with significant interactive analysis
Proceedings of the 7th ACM european conference on Computer Systems
Scalable testing of file system checkers
Proceedings of the 7th ACM european conference on Computer Systems
Resilient algorithms and data structures
CIAC'10 Proceedings of the 7th international conference on Algorithms and Complexity
Analysis of Workload Behavior in Scientific and Historical Long-Term Data Repositories
ACM Transactions on Storage (TOS)
Achieving power-efficiency in clusters without distributed file system complexity
ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Optimizing NAND flash-based SSDs via retention relaxation
FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Concurrency and Computation: Practice & Experience
VM aware journaling: improving journaling file system performance in virtualization environments
Software—Practice & Experience
Efficient cooperative backup with decentralized trust management
ACM Transactions on Storage (TOS)
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array
ACM Transactions on Storage (TOS)
Finding soon-to-fail disks in a haystack
HotStorage'12 Proceedings of the 4th USENIX conference on Hot Topics in Storage and File Systems
Understanding data survivability in archival storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Developing a power measurement framework for cyber defense
Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Theorem-based, data-driven, cyber event detection
Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop
Robustness in the Salus scalable block store
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures
ACM Transactions on Storage (TOS)
Thermal Modeling of Hybrid Storage Clusters
Journal of Signal Processing Systems
When the network crumbles: an empirical study of cloud network failures and their impact on services
Proceedings of the 4th annual Symposium on Cloud Computing
Resource failures risk assessment modelling in distributed environments
Journal of Systems and Software
Beyond MTTDL: A Closed-Form RAID 6 Reliability Equation
ACM Transactions on Storage (TOS)
FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
Hi-index | 0.02 |
It is estimated that over 90% of all new information produced in the world is being stored on magnetic media, most of it on hard disk drives. Despite their importance, there is relatively little published work on the failure patterns of disk drives, and the key factors that affect their lifetime. Most available data are either based on extrapolation from accelerated aging experiments or from relatively modest sized field studies. Moreover, larger population studies rarely have the infrastructure in place to collect health signals from components in operation, which is critical information for detailed failure analysis. We present data collected from detailed observations of a large disk drive population in a production Internet services deployment. The population observed is many times larger than that of previous studies. In addition to presenting failure statistics, we analyze the correlation between failures and several parameters generally believed to impact longevity. Our analysis identifies several parameters from the drive's self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.