An analysis of data corruption in the storage stack

Authors:
Lakshmi N. Bairavasundaram;Andrea C. Arpaci-Dusseau;Remzi H. Arpaci-Dusseau;Garth R. Goodson;Bianca Schroeder
Affiliations:
University of Wisconsin-Madison, Madison, WI;University of Wisconsin-Madison, Madison, WI;University of Wisconsin-Madison, Madison, WI;NetApp, Sunnyvale, CA;University of Toronto, Toronto, ON
Venue:
ACM Transactions on Storage (TOS)
Year:
2008

Citing 16
Cited 9

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tolerating multiple failures in RAID architectures with optimal storage and uniform declustering

Proceedings of the 24th annual international symposium on Computer architecture
Efficient Placement of Parity and Data to Tolerate Two Disk Failures in Disk Array Systems

IEEE Transactions on Parallel and Distributed Systems
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Commercial Fault Tolerance: A Tale of Two Systems

IEEE Transactions on Dependable and Secure Computing
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Ensuring data integrity in storage: techniques and applications

Proceedings of the 2005 ACM workshop on Storage security and survivability
Matrix methods for lost data reconstruction in erasure codes

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
WEAVER codes: highly fault tolerant erasure codes for storage systems

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies

Lithium: virtual machine storage for the cloud

Proceedings of the 1st ACM symposium on Cloud computing
Just one bit in a million: on the effects of data corruption in files

ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries
Scalable virtual machine storage using local disks

ACM SIGOPS Operating Systems Review
Scalable testing of file system checkers

Proceedings of the 7th ACM european conference on Computer Systems
Recon: verifying file system consistency at runtime

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Practical hardening of crash-tolerant systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Recon: Verifying file system consistency at runtime

ACM Transactions on Storage (TOS)
Robustness in the Salus scalable block store

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
When the network crumbles: an empirical study of cloud network failures and their impact on services

Proceedings of the 4th annual Symposium on Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this article, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most. We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances, including: (i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise-class disk drives, (ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and (iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.