Parity lost and parity regained

Authors:
Andrew Krioukov;Lakshmi N. Bairavasundaram;Garth R. Goodson;Kiran Srinivasan;Randy Thelen;Andrea C. Arpaci-Dusseau;Remzi H. Arpaci-Dussea
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;Network Appliance, Inc.;Network Appliance, Inc.;Network Appliance, Inc.;University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Year:
2008

Citing 27
Cited 14

Synchronized Disk Interleaving

IEEE Transactions on Computers
A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Comparison of sparing alternatives for disk arrays

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Doubly distorted mirrors

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
RAID: high-performance, reliable secondary storage

ACM Computing Surveys (CSUR)
EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Striping in a RAID level 5 disk array

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The HP AutoRAID hierarchical storage system

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Pilot: an operating system for a personal computer

Communications of the ACM
Disk Shadowing

VLDB '88 Proceedings of the 14th International Conference on Very Large Data Bases
Unifying File System Protection

Proceedings of the General Track: 2002 USENIX Annual Technical Conference
Detection of Defective Media in Disks

Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Commercial Fault Tolerance: A Tale of Two Systems

IEEE Transactions on Dependable and Secure Computing
Disk Scrubbing in Large Archival Storage Systems

MASCOTS '04 Proceedings of the The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Measuring Real-World Data Availability

LISA '01 Proceedings of the 15th USENIX conference on System administration
IRON file systems

Proceedings of the twentieth ACM symposium on Operating systems principles
Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Designing for Disasters

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Ensuring data integrity in storage: techniques and applications

Proceedings of the 2005 ACM workshop on Storage security and survivability
Model Checking An Entire Linux Distribution for Security Violations

ACSAC '05 Proceedings of the 21st Annual Computer Security Applications Conference
Using model checking to find serious file system errors

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Enhanced Reliability Modeling of RAID Storage Systems

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
An analysis of latent sector errors in disk drives

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
EXPLODE: a lightweight, general system for finding serious storage system errors

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Towards availability benchmarks: a case study of software raid systems

ATEC '00 Proceedings of the annual conference on USENIX Annual Technical Conference
An analysis of data corruption in the storage stack

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Undetected disk errors in RAID arrays

IBM Journal of Research and Development

Idle read after write: IRAW

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
DARC: design and evaluation of an I/O controller for data protection

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Keeping bits safe: how hard can it be?

Communications of the ACM
End-to-end data integrity for file systems: a ZFS case study

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Tolerating file-system mistakes with EnvyFS

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Block-level RAID is dead

HotStorage'10 Proceedings of the 2nd USENIX conference on Hot topics in storage and file systems
Keeping Bits Safe: How Hard Can It Be?

Queue - Storage
Remote data checking for network coding-based distributed storage systems

Proceedings of the 2010 ACM workshop on Cloud computing security workshop
Warding off the dangers of data corruption with amulet

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Towards reliable storage systems

Towards reliable storage systems
Consistency without ordering

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Erasure coding in windows azure storage

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Systems research and innovation in data ONTAP

ACM SIGOPS Operating Systems Review
ViewBox: integrating local file systems with cloud storage services

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.02

Visualization

Abstract

RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyzing the design of data protection strategies. Specifically, we use model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives. We evaluate the approaches taken by a number of real systems under single-error conditions, and find flaws in every scheme. In particular, we identify a parity pollution problem that spreads corrupt data (the result of a single error) across multiple disks, thus leading to data loss or corruption. We further identify which protection measures must be used to avoid such problems. Finally, we show how to combine real-world failure data with the results from the model checker to estimate the actual likelihood of data loss of different protection strategies.