Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

Authors:
Weihang Jiang;Chongfeng Hu;Yuanyuan Zhou;Arkady Kanevsky
Affiliations:
Department of Computer Science, University of Illinois at Urbana Champaign;Department of Computer Science, University of Illinois at Urbana Champaign;Department of Computer Science, University of Illinois at Urbana Champaign;Network Appliance, Inc.
Venue:
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Year:
2008

Citing 10
Cited 21

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
An Analysis of Error Behaviour in a Large Storage System

An Analysis of Error Behaviour in a Large Storage System
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Monitoring hard disks with smart

Linux Journal
Measuring Real-World Data Availability

LISA '01 Proceedings of the 15th USENIX conference on System administration
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies

An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FlexVol: flexible, efficient file volume virtualization in WAFL

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
EED: Energy Efficient Disk drive architecture

Information Sciences: an International Journal
An analysis of data corruption in the storage stack

ACM Transactions on Storage (TOS)
Migrating server storage to SSDs: analysis of tradeoffs

Proceedings of the 4th ACM European conference on Computer systems
Understanding customer problem troubleshooting from storage system logs

FAST '09 Proccedings of the 7th conference on File and storage technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
Modular data centers: how to design them?

Proceedings of the 1st ACM workshop on Large-Scale system and application performance
Uncovering errors: the cost of detecting silent data corruption

Proceedings of the 4th Annual Workshop on Petascale Data Storage
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
A tradeoff analysis of delayed reconstruction for storage clusters

Proceedings of the 6th International Wireless Communications and Mobile Computing Conference
DARC: design and evaluation of an I/O controller for data protection

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Keeping bits safe: how hard can it be?

Communications of the ACM
Keeping Bits Safe: How Hard Can It Be?

Queue - Storage
What is the future of disk drives, death or rebirth?

ACM Computing Surveys (CSUR)
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Proceedings of the sixth conference on Computer systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Making a case for distributed file systems at Exascale

Proceedings of the third international workshop on Large-scale system and application performance
Scalable testing of file system checkers

Proceedings of the 7th ACM european conference on Computer Systems
Understanding data survivability in archival storage systems

Proceedings of the 5th Annual International Systems and Storage Conference

Quantified Score

Hi-index	0.02

Visualization

Abstract

Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component - disks - and do not study other storage component failures. This paper analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The data set covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. Our study reveals many interesting findings, providing useful guideline for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20-55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for significant percentages of storage subsystem failures. (2) Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong self-correlations. In addition, these failures exhibit "bursty" patterns. (3) Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.