GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

Authors:
Mingqiang Li;Jiwu Shu;Weimin Zheng
Affiliations:
Tsinghua University, Beijing, China;Tsinghua University, Beijing, China;Tsinghua University, Beijing, China
Venue:
ACM Transactions on Storage (TOS)
Year:
2009

Citing 27
Cited 7

RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
n-dimensional codes for detecting and correcting multiple errors0

Communications of the ACM
Three and Four-dimensional Parity-check Codes for Correction and Detection of Multiple Errors

ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
A Practical Analysis of Low-Density Parity-Check Erasure Codes for Wide-Area Storage Applications

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
A Decentralized Algorithm for Erasure-Coded Virtual Disks

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Efficient Byzantine-Tolerant Erasure-Coded Storage

DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Assessing the Performance of Erasure Codes in the Wide-Area

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Small Parity-Check Erasure Codes " Exploration and Observations

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Using Erasure Codes Efficiently for Storage in a Distributed System

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
New Efficient MDS Array Codes for RAID Part I: Reed-Solomon-Like Codes for Tolerating Three Disk Failures

IEEE Transactions on Computers
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
New Efficient MDS Array Codes for RAID Part II: Rabin-Like Codes for Tolerating Multiple (greater than or equal to 4) Disk Failures

IEEE Transactions on Computers
HoVer Erasure Codes For Disk Arrays

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
IBM intelligent Bricks project: petabytes and beyond

IBM Journal of Research and Development
Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Network Storage Applications

NCA '06 Proceedings of the Fifth IEEE International Symposium on Network Computing and Applications
STAR: an efficient coding scheme for correcting triple storage node failures

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
WEAVER codes: highly fault tolerant erasure codes for storage systems

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Determining Fault Tolerance of XOR-Based Erasure Codes Efficiently

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
RobuSTore: a distributed storage architecture with robust and high performance

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The RAID-6 liberation codes

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
MDS array codes with independent parity symbols

IEEE Transactions on Information Theory
X-code: MDS array codes with optimal encoding

IEEE Transactions on Information Theory
Efficient erasure correcting codes

IEEE Transactions on Information Theory

In search of I/O-optimal recovery from disk failures

HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Generalized X-code: An efficient RAID-6 code for arbitrary size of disk array

ACM Transactions on Storage (TOS)
Performance, reliability, and performability of a hybrid RAID array and a comparison with traditional RAID1 arrays

Cluster Computing
Hierarchical RAID: Design, performance, reliability, and recovery

Journal of Parallel and Distributed Computing
Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems

ACM Transactions on Storage (TOS)
SD codes: erasure codes designed for how storage systems really fail

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
STAIR codes: a general family of erasure codes for tolerating device and sector failures in practical storage systems

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

As storage systems grow in size and complexity, they are increasingly confronted with concurrent disk failures together with multiple unrecoverable sector errors. To ensure high data reliability and availability, erasure codes with high fault tolerance are required. In this article, we present a new family of erasure codes with high fault tolerance, named GRID codes. They are called such because they are a family of strip-based codes whose strips are arranged into multi-dimensional grids. In the construction of GRID codes, we first introduce a concept of matched codes and then discuss how to use matched codes to construct GRID codes. In addition, we propose an iterative reconstruction algorithm for GRID codes. We also discuss some important features of GRID codes. Finally, we compare GRID codes with several categories of existing codes. Our comparisons show that for large-scale storage systems, our GRID codes have attractive advantages over many existing erasure codes: (a) They are completely XOR-based and have very regular structures, ensuring easy implementation; (b) they can provide up to 15 and even higher fault tolerance; and (c) their storage efficiency can reach up to 80% and even higher. All the advantages make GRID codes more suitable for large-scale storage systems.