In search of I/O-optimal recovery from disk failures

Authors:
Osama Khan;Randal Burns;James Park;Cheng Huang
Affiliations:
Department of Computer Science, Johns Hopkins University;Department of Computer Science, Johns Hopkins University;Department of Electrical Eng. and Comp. Science, University of Tennessee;Microsoft Research
Venue:
HotStorage'11 Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems
Year:
2011

Citing 18
Cited 4

RAID-II: a high-bandwidth network file server

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Awarded Best Student Paper! -- Improving Storage System Availability with D-GRAID

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Matrix methods for lost data reconstruction in erasure codes

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
WEAVER codes: highly fault tolerant erasure codes for storage systems

FAST'05 Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4
PRO: a popularity-based multi-threaded reconstruction optimization for RAID-structured storage systems

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
The RAID-6 liberation codes

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
STAR: An Efficient Coding Scheme for Correcting Triple Storage Node Failures

IEEE Transactions on Computers
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

ACM Transactions on Storage (TOS)
The Raid-6 Liber8Tion Code

International Journal of High Performance Computing Applications
Optimal recovery of single disk failure in RDP code storage systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Network coding for distributed storage systems

IEEE Transactions on Information Theory
Remote data checking for network coding-based distributed storage systems

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
Availability in globally distributed storage systems

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
On lowest density MDS codes

IEEE Transactions on Information Theory

Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
NCCloud: applying network coding for the storage repair in a cloud-of-clouds

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Pyramid Codes: Flexible Schemes to Trade Space for Access Efficiency in Reliable Data Storage Systems

ACM Transactions on Storage (TOS)
XORing elephants: novel erasure codes for big data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of minimizing the I/O needed to recover from disk failures in erasure-coded storage systems. The principal result is an algorithm that finds the optimal I/O recovery from an arbitrary number of disk failures for any XOR-based erasure code. We also describe a family of codes with high-fault tolerance and low recovery I/O, e.g. one instance tolerates up to 11 failures and recovers a lost block in 4 I/Os. While we have determined I/O optimal recovery for any given code, it remains an open problem to identify codes with the best recovery properties. We describe our ongoing efforts toward characterizing space overhead versus recovery I/O tradeoffs and generating codes that realize these bounds.