A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
An Analysis of Error Behaviour in a Large Storage System
An Analysis of Error Behaviour in a Large Storage System
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Monitoring hard disks with smart
Linux Journal
Measuring Real-World Data Availability
LISA '01 Proceedings of the 15th USENIX conference on System administration
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FlexVol: flexible, efficient file volume virtualization in WAFL
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
EED: Energy Efficient Disk drive architecture
Information Sciences: an International Journal
An analysis of data corruption in the storage stack
ACM Transactions on Storage (TOS)
Migrating server storage to SSDs: analysis of tradeoffs
Proceedings of the 4th ACM European conference on Computer systems
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
Modular data centers: how to design them?
Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Uncovering errors: the cost of detecting silent data corruption
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
A tradeoff analysis of delayed reconstruction for storage clusters
Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
DARC: design and evaluation of an I/O controller for data protection
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Keeping bits safe: how hard can it be?
Communications of the ACM
Keeping Bits Safe: How Hard Can It Be?
Queue - Storage
What is the future of disk drives, death or rebirth?
ACM Computing Surveys (CSUR)
Availability in globally distributed storage systems
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations
ACM Transactions on Storage (TOS)
Making a case for distributed file systems at Exascale
Proceedings of the third international workshop on Large-scale system and application performance
Scalable testing of file system checkers
Proceedings of the 7th ACM european conference on Computer Systems
Understanding data survivability in archival storage systems
Proceedings of the 5th Annual International Systems and Storage Conference
Hi-index | 0.02 |
Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component - disks - and do not study other storage component failures. This paper analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The data set covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. Our study reveals many interesting findings, providing useful guideline for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20-55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for significant percentages of storage subsystem failures. (2) Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong self-correlations. In addition, these failures exhibit "bursty" patterns. (3) Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.