Uncovering errors: the cost of detecting silent data corruption

Authors:
Sumit Narayan;John A. Chandy;Samuel Lang;Philip Carns;Robert Ross
Affiliations:
University of Connecticut, Storrs, CT;University of Connecticut, Storrs, CT;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL
Venue:
Proceedings of the 4th Annual Workshop on Petascale Data Storage
Year:
2009

Citing 11
Cited 0

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Unifying File System Protection

Proceedings of the General Track: 2002 USENIX Annual Technical Conference
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Ensuring data integrity in storage: techniques and applications

Proceedings of the 2005 ACM workshop on Storage security and survivability
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Are disks the dominant contributor for storage failures?: a comprehensive study of storage subsystem failure characteristics

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data integrity is pivotal to the usefulness of any storage system. It ensures that the data stored is free from any modification throughout its existence on the storage medium. Hash functions such as cyclic redundancy checks or check-sums are frequently used to detect data corruption during its transmission to permanent storage or its stay there. Without these checks, such data errors usually go undetected and unreported to the system and hence are not communicated to the application. They are referred as "silent data corruption." When an application reads corrupted or malformed data, it leads to incorrect results or a failed system. Storage arrays in leadership computing facilities comprise several thousands of drives, thus increasing the likelihood of such failures. These environments mandate a file system capable of detecting data corruption. Parallel file systems have traditionally ignored providing integrity checks because of the high computational cost, particularly in dealing with unaligned data request from scientific applications. In this paper, we assess the cost of providing data integrity on a parallel file system. We present an approach that provides this capability with as low as 5% overhead for writes and 22% overhead for reads for aligned requests and some additional cost for unaligned requests.