SD codes: erasure codes designed for how storage systems really fail

Authors:
James S. Plank;Mario Blaum;James L. Hafner
Affiliations:
EECS Department, University of Tennessee;IBM Research Division, Almaden Research Center;IBM Research Division, Almaden Research Center
Venue:
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Year:
2013

Citing 40
Cited 0

Beating the I/O bottleneck: a case for log-structured file systems

ACM SIGOPS Operating Systems Review
EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems

Software—Practice & Experience
LT Codes

FOCS '02 Proceedings of the 43rd Symposium on Foundations of Computer Science
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Hydra: a platform for survivable and secure data storage systems

Proceedings of the 2005 ACM workshop on Storage security and survivability
HoVer Erasure Codes For Disk Arrays

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
An implementation of a log-structured file system for UNIX

USENIX'93 Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
REO: a generic RAID engine and optimizer

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors

ACM Transactions on Storage (TOS)
Pergamum: replacing tape with energy efficient, reliable, disk-based archival storage

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Scalable performance of the Panasas parallel file system

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
FlexVol: flexible, efficient file volume virtualization in WAFL

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Undetected disk errors in RAID arrays

IBM Journal of Research and Development
GRID codes: Strip-based erasure codes with high fault tolerance for storage systems

ACM Transactions on Storage (TOS)
A Highly Accurate Method for Assessing Reliability of Redundant Arrays of Inexpensive Disks (RAID)

IEEE Transactions on Computers
A performance evaluation and examination of open-source erasure coding libraries for storage

FAST '09 Proccedings of the 7th conference on File and storage technologies
POTSHARDS—a secure, recoverable, long-term archival storage system

ACM Transactions on Storage (TOS)
Storage architecture with integrity, redundancy and encryption

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
HAIL: a high-availability and integrity layer for cloud storage

Proceedings of the 16th ACM conference on Computer and communications security
Differential RAID: Rethinking RAID for SSD reliability

ACM Transactions on Storage (TOS)
Optimal recovery of single disk failure in RDP code storage systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A clean-slate look at disk scrubbing

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Understanding latent sector errors and how to protect against them

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
DFS: a file system for virtualized flash storage

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Remote data checking for network coding-based distributed storage systems

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
Flat XOR-based erasure codes in storage systems: Constructions, efficient recovery, and tradeoffs

MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST)
AONT-RS: blending security and performance in dispersed storage systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
Pond: the oceanstore prototype

FAST'03 Proceedings of the 2nd USENIX conference on File and storage technologies
Row-diagonal parity for double disk failure correction

FAST'04 Proceedings of the 3rd USENIX conference on File and storage technologies
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Efficient software implementations of large finite fields GF(2n) for secure storage applications

ACM Transactions on Storage (TOS)
Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
NCCloud: applying network coding for the storage repair in a cloud-of-clouds

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
On lowest density MDS codes

IEEE Transactions on Information Theory
Erasure coding in windows azure storage

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Practical scrubbing: Getting to the bad sector at the right time

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, when storage systems employ erasure codes, they are designed to tolerate the failures of entire disks. However, the most common types of failures are latent sector failures, which only affect individual disk sectors, and block failures which arise through wear on SSD's. This paper introduces SD codes, which are designed to tolerate combinations of disk and sector failures. As such, they consume far less storage resources than traditional erasure codes. We specify the codes with enough detail for the storage practitioner to employ them, discuss their practical properties, and detail an open-source implementation.