Evaluation of Distributed Recovery in Large-Scale Storage Systems

Authors:
Qin Xin;Ethan L. Miller;Thomas J. E. Schwarz
Affiliations:
University of California, Santa Cruz;University of California, Santa Cruz;Santa Clara University
Venue:
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
Year:
2004

Citing 0
Cited 13

LH*RS---a highly-available scalable distributed data structure

ACM Transactions on Database Systems (TODS)
Efficient Updates in Highly Available Distributed Random Access Memory

ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
CRUSH: controlled, scalable, decentralized placement of replicated data

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Ceph: a scalable, high-performance distributed file system

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance

FAST '09 Proccedings of the 7th conference on File and storage technologies
R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems

Proceedings of the 23rd international conference on Supercomputing
Maintaining and checking parity in highly available Scalable Distributed Data Structures

Journal of Systems and Software
Optimal recovery of single disk failure in RDP code storage systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Victim disk first: an asymmetric cache to boost the performance of disk arrays under faulty conditions

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation

ACM Transactions on Storage (TOS)
A reliability optimization method for RAID-structured storage systems based on active data migration

Journal of Systems and Software
IDO: intelligent data outsourcing with improved RAID reconstruction performance in large-scale data centers

lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Effect of codeword placement on the reliability of erasure coded data storage systems

QEST'13 Proceedings of the 10th international conference on Quantitative Evaluation of Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarantee the necessary high data reliability for such systems, because disk rebuild time lengthens as disk capacity grows. In this paper, we present FAst Recovery Mechanism (FARM), a distributed recovery approach that exploits excess disk capacity and reduces data recovery time. FARM works in concert with replication and erasure-coding redundancy schemes to dramatically lower the probability of data loss in large-scale storage systems. We have examined essential factors that influence system reliability, performance, and costs, such as failure detections, disk bandwidth usage for recovery, disk space utilization, disk drive replacement, and system scales, by simulating system behavior under disk failures. Our results show the reliability improvement from FARM and demonstrate the impacts of various factors on system reliability. Using our techniques, system designers will be better able to build multi-petabyte storage systems with much higher reliability at lower cost than previously possible.