EXPLODE: a lightweight, general system for finding serious storage system errors

Authors:
Junfeng Yang;Can Sar;Dawson Engler
Affiliations:
Stanford University;Stanford University;Stanford University
Venue:
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Year:
2006

Citing 0
Cited 28

Parity lost and parity regained

FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Zyzzyva: speculative Byzantine fault tolerance

Communications of the ACM - Remembering Jim Gray
BFT: the time is now

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Predicting and preventing inconsistencies in deployed distributed systems

ACM Transactions on Computer Systems (TOCS)
Membrane: Operating system support for restartable file systems

ACM Transactions on Storage (TOS)
End-to-end data integrity for file systems: a ZFS case study

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Membrane: operating system support for restartable file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
SQCK: a declarative file system checker

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Tolerating file-system mistakes with EnvyFS

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Towards automatically checking thousands of failures with micro-specifications

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Depot: cloud storage with minimal trust

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
S2E: a platform for in-vivo multi-path analysis of software systems

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Making the common case the only case with anticipatory memory allocation

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
FATE and DESTINI: a framework for cloud recovery testing

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Depot: Cloud Storage with Minimal Trust

ACM Transactions on Computer Systems (TOCS)
Towards reliable storage systems

Towards reliable storage systems
Making the common case the only case with anticipatory memory allocation

ACM Transactions on Storage (TOS)
The S2E Platform: Design, Implementation, and Applications

ACM Transactions on Computer Systems (TOCS) - Special Issue APLOS 2011
Scalable testing of file system checkers

Proceedings of the 7th ACM european conference on Computer Systems
Recon: verifying file system consistency at runtime

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Consistency without ordering

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Towards efficient, portable application-level consistency

Proceedings of the 9th Workshop on Hot Topics in Dependable Systems
Ffsck: The Fast File-System Checker

ACM Transactions on Storage (TOS)
A Study of Linux File System Evolution

ACM Transactions on Storage (TOS)
Ffsck: the fast file system checker

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
A study of Linux file system evolution

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
HARDFS: hardening HDFS with selective and lightweight versioning

FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Storage systems such as file systems, databases, and RAID systems have a simple, basic contract: you give them data, they do not lose or corrupt it. Often they store the only copy, making its irrevocable loss almost arbitrarily bad. Unfortunately, their code is exceptionally hard to get right, since it must correctly recover from any crash at any program point, no matter how their state was smeared across volatile and persistent memory. This paper describes EXPLODE, a system that makes it easy to systematically check real storage systems for errors. It takes user-written, potentially system-specific checkers and uses them to drive a storage system into tricky corner cases, including crash recovery errors. EXPLODE uses a novel adaptation of ideas from model checking, a comprehensive, heavy-weight formal verification technique, that makes its checking more systematic (and hopefully more effective) than a pure testing approach while being just as lightweight. EXPLODE is effective. It found serious bugs in a broad range of real storage systems (without requiring source code): three version control systems, Berkeley DB, an NFS implementation, ten file systems, a RAID system, and the popular VMware GSX virtual machine. We found bugs in every system we checked, 36 bugs in total, typically with little effort.