A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

Authors:
Ajay Dholakia;Evangelos Eleftheriou;Xiao-Yu Hu;Ilias Iliadis;Jai Menon;K.K. Rao
Affiliations:
IBM Systems and Technology Group, Research Triangle Park, NC;IBM Zurich Research Laboratory, Rüschlikon, Switzerland;IBM Zurich Research Laboratory, Rüschlikon, Switzerland;IBM Zurich Research Laboratory, Rüschlikon, Switzerland;IBM Systems and Technology Group, San Jose, CA;IBM Almaden Research Center, San Jose, CA
Venue:
ACM Transactions on Storage (TOS)
Year:
2008

Citing 16
Cited 18

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Reliability analysis of redundant arrays of inexpensive disks

Journal of Parallel and Distributed Computing - Special issue on parallel I/O systems
An introduction to disk drive modeling

Computer
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Data integrity analysis of disk array systems with analytic modeling of coverage

Performance Evaluation - Special issue: 6th international conference on modelling techniques and tools for computer performance evaluation
A Performance Evaluation of RAID Architectures

IEEE Transactions on Computers
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
Reliability Mechanisms for Very Large Storage Systems

MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
Reliability Analysis of Disk Array Organizations by Considering Uncorrectable Bit Errors

SRDS '97 Proceedings of the 16th Symposium on Reliable Distributed Systems
Issues and Challenges in the Performance Analysis of Real Disk Arrays

IEEE Transactions on Parallel and Distributed Systems
Reliability and security of RAID storage systems and D2D archives using SATA disk drives

ACM Transactions on Storage (TOS)
Theory, Volume 1, Queueing Systems

Theory, Volume 1, Queueing Systems
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Designing for Disasters

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
RobuSTore: a distributed storage architecture with robust and high performance

Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems

SIGMETRICS '08 Proceedings of the 2008 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Higher reliability redundant disk arrays: Organization, operation, and coding

ACM Transactions on Storage (TOS)
Adding aggressive error correction to a high-performance compressing flash file system

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Understanding latent sector errors and how to protect against them

ACM Transactions on Storage (TOS)
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
On the impact of disk scrubbing on energy savings

HotPower'08 Proceedings of the 2008 conference on Power aware computing and systems
Online availability upgrades for parity-based RAIDs through supplementary parity augmentations

ACM Transactions on Storage (TOS)
Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems

ACM Transactions on Storage (TOS)
Why specialized disks for composite operations may be unnecessary

ACM SIGARCH Computer Architecture News
Survey and analysis of disk scheduling methods

ACM SIGARCH Computer Architecture News
Rebuild processing in RAID5 with emphasis on the supplementary parity augmentation method[37]

ACM SIGARCH Computer Architecture News
Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays

Cluster Computing
Hierarchical RAID: Design, performance, reliability, and recovery

Journal of Parallel and Distributed Computing
An overview of codes tailor-made for better repairability in networked distributed storage systems

ACM SIGACT News
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems

ACM Transactions on Storage (TOS)
SD codes: erasure codes designed for how storage systems really fail

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today's data storage systems are increasingly adopting low-cost disk drives that have higher capacity but lower reliability, leading to more frequent rebuilds and to a higher risk of unrecoverable media errors. We propose an efficient intradisk redundancy scheme to enhance the reliability of RAID systems. This scheme introduces an additional level of redundancy inside each disk, on top of the RAID redundancy across multiple disks. The RAID parity provides protection against disk failures, whereas the proposed scheme aims to protect against media-related unrecoverable errors. In particular, we consider an intradisk redundancy architecture that is based on an interleaved parity-check coding scheme, which incurs only negligible I/O performance degradation. A comparison between this coding scheme and schemes based on traditional Reed--Solomon codes and single-parity-check codes is conducted by analytical means. A new model is developed to capture the effect of correlated unrecoverable sector errors. The probability of an unrecoverable failure associated with these schemes is derived for the new correlated model, as well as for the simpler independent error model. We also derive closed-form expressions for the mean time to data loss of RAID-5 and RAID-6 systems in the presence of unrecoverable errors and disk failures. We then combine these results to characterize the reliability of RAID systems that incorporate the intradisk redundancy scheme. Our results show that in the practical case of correlated errors, the interleaved parity-check scheme provides the same reliability as the optimum, albeit more complex, Reed--Solomon coding scheme. Finally, the I/O and throughput performances are evaluated by means of analysis and event-driven simulation.