A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation

Authors:
Liping Xiang;Yinlong Xu;John C. S. Lui;Qian Chang;Yubiao Pan;Runhui Li
Affiliations:
University of Science and Technology of China;University of Science and Technology of China;The Chinese University of Hong Kong;University of Science and Technology of China;University of Science and Technology of China;University of Science and Technology of China
Venue:
ACM Transactions on Storage (TOS)
Year:
2011

Citing 30
Cited 1

Comparison of sparing alternatives for disk arrays

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Parity declustering for continuous operation in redundant disk arrays

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
Architectures and algorithms for on-line failure recovery in redundant disk arrays

Distributed and Parallel Databases - Special issue on disk arrays
A survey of partial difference sets

Designs, Codes and Cryptography
On-line data reconstruction in redundant disk arrays

On-line data reconstruction in redundant disk arrays
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
OceanStore: an architecture for global-scale persistent storage

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Automatic Recovery from Disk Failure in Continuous-Media Servers

IEEE Transactions on Parallel and Distributed Systems
Analytic Modeling of Clustered RAID with Mapping Based on Nearly Random Permutation

IEEE Transactions on Computers
RAID5 Performance with Distributed Sparing

IEEE Transactions on Parallel and Distributed Systems
Performance Analysis of Disk Arrays under Failure

VLDB '90 Proceedings of the 16th International Conference on Very Large Data Bases
Reliability Mechanisms for Very Large Storage Systems

MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Evaluation of Distributed Recovery in Large-Scale Storage Systems

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
A fresh look at the reliability of long-term digital storage

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Matrix methods for lost data reconstruction in erasure codes

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Failure trends in a large disk drive population

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
PRO: a popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
RAIF: Redundant Array of Independent Filesystems

MSST '07 Proceedings of the 24th IEEE Conference on Mass Storage Systems and Technologies
The RAID-6 liberation codes

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
A performance evaluation and examination of open-source erasure coding libraries for storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
Optimal recovery of single disk failure in RDP code storage systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Network coding for distributed storage systems

IEEE Transactions on Information Theory
On the impact of flash SSDs on spatial indexing

Proceedings of the Sixth International Workshop on Data Management on New Hardware
Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
X-code: MDS array codes with optimal encoding

IEEE Transactions on Information Theory

NCCloud: applying network coding for the storage repair in a cloud-of-clouds

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The current parallel storage systems use thousands of inexpensive disks to meet the storage requirement of applications. Data redundancy and/or coding are used to enhance data availability, for instance, Row-diagonal parity (RDP) and EVENODD codes, which are widely used in RAID-6 storage systems, provide data availability with up to two disk failures. To reduce the probability of data unavailability, whenever a single disk fails, disk recovery will be carried out. We find that the conventional recovery schemes of RDP and EVENODD codes for a single failed disk only use one parity disk. However, there are two parity disks in the system, and both can be used for single disk failure recovery. In this article, we propose a hybrid recovery approach that uses both parities for single disk failure recovery, and we design efficient recovery schemes for RDP code (RDOR-RDP) and EVENODD code (RDOR-EVENODD). Our recovery scheme has the following attractive properties: (1) “read optimality” in the sense that our scheme issues the smallest number of disk reads to recover a single failed disk and it reduces approximately 1/4 of disk reads compared with conventional schemes; (2) “load balancing property” in that all surviving disks will be subjected to the same (or almost the same) amount of additional workload in rebuilding the failed disk. We carry out performance evaluation to quantify the merits of RDOR-RDP and RDOR-EVENODD on some widely used disks with DiskSim. The offline experimental results show that RDOR-RDP and RDOR-EVENODD outperform the conventional recovery schemes of RDP and EVENODD codes in terms of total recovery time and recovery workload on individual surviving disk. However, the improvements are less than the theoretical value (approximately 25%), as RDOR-RDP and RDOR-EVENODD change the disk access pattern from purely sequential to a more random one compared with their conventional schemes.