Failed disk recovery in double erasure RAID arrays

Authors:
Kaushik Srinivasan;Charles J. Colbourn
Affiliations:
Department of Computer Science and Engineering, Arizona State University, PO Box 878809, Tempe, AZ 85287-8809, USA;Department of Computer Science and Engineering, Arizona State University, PO Box 878809, Tempe, AZ 85287-8809, USA
Venue:
Journal of Discrete Algorithms
Year:
2007

Citing 11
Cited 1

Efficient dispersal of information for security, load balancing, and fault tolerance

Journal of the ACM (JACM)
Redundant disk arrays: reliable, parallel secondary storage

Redundant disk arrays: reliable, parallel secondary storage
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Asymptotically optimal erasure-resilient codes for large disk arrays

Discrete Applied Mathematics - Coding, cryptography and computer security
The Combinatorics of Network Reliability

The Combinatorics of Network Reliability
On Some Polynomials Related to Weight Enumerators of Linear Codes

SIAM Journal on Discrete Mathematics
Disk Striping

Proceedings of the Second International Conference on Data Engineering
Linear time erasure codes with nearly optimal recovery

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Issues and Challenges in the Performance Analysis of Real Disk Arrays

IEEE Transactions on Parallel and Distributed Systems

A highly reliable and parallelizable data distribution scheme for data grids

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reliability is a major concern in the design of large disk arrays. In this paper, we examine the effect of encountering more failures than that for which the RAID array was initially designed. Erasure codes are incorporated to enable system recovery from a specified number of disk erasures, and strive beyond that threshold to recover the system as frequently, and as thoroughly, as is possible. Erasure codes for tolerating two disk failures are examined. For these double erasure codes, we establish a correspondence between system operation and acyclicity of its graph model. For the most compact double erasure code, the full 2-code, this underlies an efficient algorithm for the computation of system operation probability (all disks operating or recoverable). When the system has failed, some disks are nonetheless recoverable. We extend the graph model to determine the probability that d disks have failed, a of which are recoverable by solving one linear equation, b of which are further recoverable by solving systems of linear equations, and d-a-b of which cannot be recovered. These statistics are efficiently calculated for the full 2-code by developing a three variable ordinary generating function whose coefficients give the specified values. Finally, examples are given to illustrate the probability that an individual disk can be recovered, even when the system is in a failed state.