Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

Authors:
Weihang Jiang;Chongfeng Hu;Yuanyuan Zhou;Arkady Kanevsky
Affiliations:
University of Illinois at Urbana Champaign, Urbana, IL;University of Illinois at Urbana Champaign, Urbana, IL;University of Illinois at Urbana Champaign, Urbana, IL;Network Appliance, Inc., Sunnyvale, CA
Venue:
ACM Transactions on Storage (TOS)
Year:
2008

Citing 10
Cited 6

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
An Analysis of Error Behaviour in a Large Storage System

An Analysis of Error Behaviour in a Large Storage System
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Monitoring hard disks with smart

Linux Journal
Measuring Real-World Data Availability

LISA '01 Proceedings of the 15th USENIX conference on System administration
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies

Predicting disk failures with HMM- and HSMM-based approaches

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Redundantly grouped cross-object coding for repairable storage

Proceedings of the Asia-Pacific Workshop on Systems
Redundantly grouped cross-object coding for repairable storage

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Exploiting Redundancies and Deferred Writes to Conserve Energy in Erasure-Coded Storage Clusters

ACM Transactions on Storage (TOS)
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures. This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.