Eraser: a dynamic data race detector for multithreaded programs
ACM Transactions on Computer Systems (TOCS)
ACM Transactions on Computer Systems (TOCS)
Revisiting the PAXOS algorithm
Theoretical Computer Science
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Distributed Algorithms
Chord: a scalable peer-to-peer lookup protocol for internet applications
IEEE/ACM Transactions on Networking (TON)
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Implementing declarative overlays
Proceedings of the twentieth ACM symposium on Operating systems principles
RaceTrack: efficient detection of data race conditions via adaptive tracking
Proceedings of the twentieth ACM symposium on Operating systems principles
Simulating Large-Scale P2P Systems with the WiDS Toolkit
MASCOTS '05 Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems
Internet Scale Testing of PNRP Using WiDS Network Simulator
P2P '06 Proceedings of the Sixth IEEE International Conference on Peer-to-Peer Computing
Using model checking to find serious file system errors
ACM Transactions on Computer Systems (TOCS)
Cork: dynamic memory leak detection for garbage-collected languages
Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Using queries for distributed monitoring and forensics
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
WiDS: an integrated toolkit for distributed system development
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Model checking large network protocol implementations
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
MACEDON: methodology for automatically creating, evaluating, and designing overlay networks
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Boxwood: abstractions as the foundation for storage infrastructure
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Replay debugging for distributed applications
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Life, death, and the critical transition: finding liveness bugs in systems code
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
BitVault: a highly reliable distributed data retention platform
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
CrystalBall: predicting and preventing inconsistencies in deployed distributed systems
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
DSF: a common platform for distributed systems research and development
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Predicting and preventing inconsistencies in deployed distributed systems
ACM Transactions on Computer Systems (TOCS)
A query language for understanding component interactions in production systems
Proceedings of the 24th ACM International Conference on Supercomputing
DSF: a common platform for distributed systems research and development
Middleware'09 Proceedings of the ACM/IFIP/USENIX 10th international conference on Middleware
Measurement and diagnosis of address misconfigured P2P traffic
INFOCOM'10 Proceedings of the 29th conference on Information communications
R2: an application-level kernel for record and replay
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Finding and reproducing Heisenbugs in concurrent programs
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Finding latent performance bugs in systems implementations
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
Towards automatically checking thousands of failures with micro-specifications
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Model checking a networked system without the network
Proceedings of the 8th USENIX conference on Networked systems design and implementation
FATE and DESTINI: a framework for cloud recovery testing
Proceedings of the 8th USENIX conference on Networked systems design and implementation
Life, death, and the critical transition: finding liveness bugs in systems code
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay
NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
InContext: simple parallelism for distributed applications
Proceedings of the 20th international symposium on High performance distributed computing
A characteristic study on failures of production distributed data-parallel programs
Proceedings of the 2013 International Conference on Software Engineering
On fault resilience of OpenStack
Proceedings of the 4th annual Symposium on Cloud Computing
DEFINED: deterministic execution for interactive control-plane debugging
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Hi-index | 0.00 |
Despite many efforts, the predominant practice of debugging a distributed system is still printf-based log mining, which is both tedious and error-prone. In this paper, we present WiDS Checker, a unified framework that can check distributed systems through both simulation and reproduced runs from real deployment. All instances of a distributed system can be executed within one simulation process, multiplexed properly to observe the "happensbefore" relationship, thus accurately reveal full system state. A versatile script language allows a developer to refine system properties into straightforward assertions, which the checker inspects for violations. Combining these two components, we are able to check distributed properties that are otherwise impossible to check. We applied WiDS Checker over a suite of complex and real systems and found non-trivial bugs, including one in a previously proven Paxos specification. Our experience demonstrates the usefulness of the checker and allows us to gain insights beneficial to future research in this area.