dBug: systematic evaluation of distributed systems

Authors:
Jiri Simsa;Randy Bryant;Garth Gibson
Affiliations:
Computer Science Department, Carnegie Mellon University;Computer Science Department, Carnegie Mellon University;Computer Science Department, Carnegie Mellon University
Venue:
SSV'10 Proceedings of the 5th international conference on Systems software verification
Year:
2010

Citing 22
Cited 2

Automatic verification of finite-state concurrent systems using temporal logic specifications

ACM Transactions on Programming Languages and Systems (TOPLAS)
The growth of software testing

Communications of the ACM
Model checking for programming languages using VeriSoft

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Model checking

Model checking
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Symbolic Logic and Mechanical Theorem Proving

Symbolic Logic and Mechanical Theorem Proving
GPFS: A Shared-Disk File System for Large Computing Clusters

FAST '02 Proceedings of the Conference on File and Storage Technologies
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Dynamic partial-order reduction for model checking software

Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Thorough static analysis of device drivers

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
EXPLODE: a lightweight, general system for finding serious storage system errors

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
MODIST: transparent model checking of unmodified distributed systems

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
seL4: formal verification of an OS kernel

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Finding and reproducing Heisenbugs in concurrent programs

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Life, death, and the critical transition: finding liveness bugs in systems code

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

dBug: systematic testing of unmodified distributed and multi-threaded systems

Proceedings of the 18th international SPIN conference on Model checking software
Detecting problematic message sequences and frequencies in distributed systems

Proceedings of the ACM international conference on Object oriented programming systems languages and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the design, implementation and evaluation of "dBug" - a tool that leverages manual instrumentation for systematic evaluation of distributed and concurrent systems. Specifically, for a given distributed concurrent system, its initial state and a workload, the dBug tool systematically explores possible orders in which concurrent events triggered by the workload can happen. Further, dBug optionally uses the partial order reduction mechanism to avoid exploration of equivalent orders. Provided with a correctness check, the dBug tool is able to verify that all possible serializations of a given concurrent workload execute correctly. Upon encountering an error, the tool produces a trace that can be replayed to investigate the error. We applied the dBug tool to two distributed systems - the Parallel Virtual File System (PVFS) implemented in C and the FAWN-based key-value storage (FAWN-KV) implemented in C++. In particular, we integrated both systems with dBug to expose the non-determinism due to concurrency. This mechanism was used to verify that the result of concurrent execution of a number of basic operations from a fixed initial state meets the high-level specification of PVFS and FAWN-KV. The experimental evidence shows that the dBug tool is capable of systematically exploring behaviors of a distributed system in a modular, practical, and effective manner.