A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Data integrity analysis of disk array systems with analytic modeling of coverage
Performance Evaluation - Special issue: 6th international conference on modelling techniques and tools for computer performance evaluation
Disk Scrubbing in Large Archival Storage Systems
MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction
FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Ensuring data integrity in storage: techniques and applications
Proceedings of the 2005 ACM workshop on Storage security and survivability
An analysis of latent sector errors in disk drives
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population
FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
An analysis of data corruption in the storage stack
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Parity lost and parity regained
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Higher reliability redundant disk arrays: Organization, operation, and coding
ACM Transactions on Storage (TOS)
Keeping bits safe: how hard can it be?
Communications of the ACM
Keeping Bits Safe: How Hard Can It Be?
Queue - Storage
ACM Transactions on Storage (TOS)
Building intelligence for software defined data centers: modeling usage patterns
Proceedings of the 6th International Systems and Storage Conference
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems
ACM Transactions on Storage (TOS)
SD codes: erasure codes designed for how storage systems really fail
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.02 |
Though remarkably reliable, disk drives do fail occasionally. Most failures can be detected immediately; moreover, such, failures can be modeled and addressed using technologies such as RAID (Redundant Arrays of Independent Disks). Unfortunately, disk drives can experience errors that are undetected by the drive-- which we refer to as undetected disk errors (UDEs). These errors can cause silent data corruption that may go completely undetected (until a system or application malfunction) or may be detected by software in the storage I/O stack. Continual increases in disk densities or in storage array sizes and more significantly the introduction of desktop-class drives in enterprise storage systems are increasing the likelihood of UDEs in a given system. Therefore, the incorporation of UDE detection (and correction) into storage systems is necessary to prevent increasing numbers of data corruption and data loss events. In this paper, we discuss the causes of UDEs and their effects on data integrity. We describe some of the basic techniques that have been applied to address this problem at various software layers in the I/O stack and describe a family of solutions that can be integrated into the RAID subsystem.