Live Debugging of Distributed Systems

Authors:
Darren Dao;Jeannie Albrecht;Charles Killian;Amin Vahdat
Affiliations:
University of California, San Diego, La Jolla;Williams College, Williamstown;Purdue University, West Lafayette;University of California, San Diego, La Jolla
Venue:
CC '09 Proceedings of the 18th International Conference on Compiler Construction: Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009
Year:
2009

Citing 14
Cited 7

The part-time parliament

ACM Transactions on Computer Systems (TOCS)
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Chord: a scalable peer-to-peer lookup protocol for internet applications

IEEE/ACM Transactions on Networking (TON)
Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems

Middleware '01 Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg
SCRIBE: The Design of a Large-Scale Event Notification Infrastructure

NGC '01 Proceedings of the Third International COST264 Workshop on Networked Group Communication
Bullet: high bandwidth data dissemination using an overlay mesh

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
SplitStream: high-bandwidth multicast in cooperative environments

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Scalability and accuracy in a large-scale network emulator

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Mace: language support for building distributed systems

Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Using random subsets to build scalable network services

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Life, death, and the critical transition: finding liveness bugs in systems code

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Friday: global comprehension for distributed replay

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation

Mining temporal invariants from partially ordered logs

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
ALIAS: scalable, decentralized label assignment for data centers

Proceedings of the 2nd ACM Symposium on Cloud Computing
Mining temporal invariants from partially ordered logs

ACM SIGOPS Operating Systems Review
Aspen trees: balancing data center fault tolerance, scalability and cost

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Global property violation detection and diagnosis for wireless sensor networks

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Distributed debugging for mobile networks

Journal of Systems and Software
NetCheck: network diagnoses from blackbox traces

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Debugging distributed systems is challenging. Although incremental debugging during development finds some bugs, developers are rarely able to fully test their systems under realistic operating conditions prior to deployment. While deploying a system exposes it to realistic conditions, debugging requires the developer to: (i) detect a bug, (ii) gather the system state necessary for diagnosis, and (iii) sift through the gathered state to determine a root cause. In this paper, we present MaceODB, a tool to assist programmers with debugging deployed distributed systems. Programmers define a set of runtime properties for their system, which MaceODB checks for violations during execution. Once MaceODB detects a violation, it provides the programmer with the information to determine its root cause. We have been able to diagnose several non-trivial bugs in existing mature distributed systems using MaceODB; we discuss two of these bugs in this paper. Benchmarks indicate that the approach has low overhead and is suitable for in situ debugging of deployed systems.