LH*RS---a highly-available scalable distributed data structure
ACM Transactions on Database Systems (TODS)
Efficient Updates in Highly Available Distributed Random Access Memory
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
CRUSH: controlled, scalable, decentralized placement of replicated data
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Ceph: a scalable, high-performance distributed file system
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
WorkOut: I/O workload outsourcing for boosting RAID reconstruction performance
FAST '09 Proccedings of the 7th conference on File and storage technologies
R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems
Proceedings of the 23rd international conference on Supercomputing
Maintaining and checking parity in highly available Scalable Distributed Data Structures
Journal of Systems and Software
Optimal recovery of single disk failure in RDP code storage systems
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A Hybrid Approach to Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation
ACM Transactions on Storage (TOS)
A reliability optimization method for RAID-structured storage systems based on active data migration
Journal of Systems and Software
lisa'12 Proceedings of the 26th international conference on Large Installation System Administration: strategies, tools, and techniques
Effect of codeword placement on the reliability of erasure coded data storage systems
QEST'13 Proceedings of the 10th international conference on Quantitative Evaluation of Systems
Hi-index | 0.00 |
Storage clusters consisting of thousands of disk drives are now being used both for their large capacity and high throughput. However, their reliability is far worse than that of smaller storage systems due to the increased number of storage nodes. RAID technology is no longer sufficient to guarantee the necessary high data reliability for such systems, because disk rebuild time lengthens as disk capacity grows. In this paper, we present FAst Recovery Mechanism (FARM), a distributed recovery approach that exploits excess disk capacity and reduces data recovery time. FARM works in concert with replication and erasure-coding redundancy schemes to dramatically lower the probability of data loss in large-scale storage systems. We have examined essential factors that influence system reliability, performance, and costs, such as failure detections, disk bandwidth usage for recovery, disk space utilization, disk drive replacement, and system scales, by simulating system behavior under disk failures. Our results show the reliability improvement from FARM and demonstrate the impacts of various factors on system reliability. Using our techniques, system designers will be better able to build multi-petabyte storage systems with much higher reliability at lower cost than previously possible.