Undetected disk errors in RAID arrays

Authors:
J. L. Hafner;V. Deenadhayalan;W. Belluomini;K. Rao
Affiliations:
IBM Research Division, IBM Almaden Research Center, San Jose, California;IBM Research Division, IBM Almaden Research Center, San Jose, California;IBM Research Division, IBM Almaden Research Center, San Jose, California;IBM Research Division, IBM Almaden Research Center, San Jose, California
Venue:
IBM Journal of Research and Development
Year:
2008

Citing 11
Cited 8

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Data integrity analysis of disk array systems with analytic modeling of coverage

Performance Evaluation - Special issue: 6th international conference on modelling techniques and tools for computer performance evaluation
Disk Scrubbing in Large Archival Storage Systems

MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Ensuring data integrity in storage: techniques and applications

Proceedings of the 2005 ACM workshop on Storage security and survivability
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies

Parity lost and parity regained

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Keeping bits safe: how hard can it be?

Communications of the ACM
Keeping Bits Safe: How Hard Can It Be?

Queue - Storage
Minimum density RAID-6 codes

ACM Transactions on Storage (TOS)
Building intelligence for software defined data centers: modeling usage patterns

Proceedings of the 6th International Systems and Storage Conference
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems

ACM Transactions on Storage (TOS)
SD codes: erasure codes designed for how storage systems really fail

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

Though remarkably reliable, disk drives do fail occasionally. Most failures can be detected immediately; moreover, such, failures can be modeled and addressed using technologies such as RAID (Redundant Arrays of Independent Disks). Unfortunately, disk drives can experience errors that are undetected by the drive-- which we refer to as undetected disk errors (UDEs). These errors can cause silent data corruption that may go completely undetected (until a system or application malfunction) or may be detected by software in the storage I/O stack. Continual increases in disk densities or in storage array sizes and more significantly the introduction of desktop-class drives in enterprise storage systems are increasing the likelihood of UDEs in a given system. Therefore, the incorporation of UDE detection (and correction) into storage systems is necessary to prevent increasing numbers of data corruption and data loss events. In this paper, we discuss the causes of UDEs and their effects on data integrity. We describe some of the basic techniques that have been applied to address this problem at various software layers in the I/O stack and describe a family of solutions that can be integrated into the RAID subsystem.