Diagnosing distributed systems with self-propelled instrumentation

Authors:
Alexander V. Mirgorodskiy;Barton P. Miller
Affiliations:
VMware, Inc.;University of Wisconsin
Venue:
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Year:
2008

Citing 34
Cited 4

Techniques for debugging parallel programs with flowback analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
An open graph visualization system and its applications to software engineering

Software—Practice & Experience - Special issue on discrete algorithm engineering
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Finding failures by cluster analysis of execution profiles

ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Bugs as deviant behavior: a general approach to inferring errors in systems code

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Isolating failure-inducing thread schedules

ISSTA '02 Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis
Visualization of test information to assist fault localization

Proceedings of the 24th International Conference on Software Engineering
Isolating cause-effect chains from computer programs

Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
DPM: A Measurement System for Distributed Programs

IEEE Transactions on Computers
Dynamic Program Dicing

ICSM '93 Proceedings of the Conference on Software Maintenance
UNIX Network Programming, Vol. 1

UNIX Network Programming, Vol. 1
Performance debugging for distributed systems of black boxes

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Backtracking intrusions

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stateful distributed interposition

ACM Transactions on Computer Systems (TOCS)
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Scalable statistical bug isolation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
TraceBack: first fault diagnosis by reconstruction of distributed control flow

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Capturing, indexing, clustering, and retrieving system history

Proceedings of the twentieth ACM symposium on Operating systems principles
Stardust: tracking activity in a distributed storage system

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring

ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
A comparison of software and hardware techniques for x86 virtualization

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automated known problem diagnosis with event traces

Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
DIGITAL FX!32 running 32-bit ×86 applications on alpha NT

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Pip: detecting the unexpected in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Whodunit: transactional profiling for multi-tier applications

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Triage: diagnosing production run failures at the user's site

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Causeway: support for controlling and analyzing the execution of multi-tier applications

Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Understanding cross-tier delay of multi-tier application using selective invocation context extraction

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A generic solution for agile run-time inspection middleware

Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
A multi-level monitoring framework for stream-based coordination programs

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Automated root cause isolation of performance regressions during software development

Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a three-part approach for diagnosing bugs and performance problems in production distributed environments. First, we introduce a novel execution monitoring technique that dynamically injects a fragment of code, the agent, into an application process on demand. The agent inserts instrumentation ahead of the control flow within the process and propagates into other processes, following communication events, crossing host boundaries, and collecting a distributed function-level trace of the execution. Second, we present an algorithm that separates the trace into user-meaningful activities called flows. This step simplifies manual examination and enables automated analysis of the trace. Finally, we describe our automated root cause analysis technique that compares the flows to help the analyst locate an anomalous flow and identify a function in that flow that is a likely cause of the anomaly. We demonstrate the effectiveness of our techniques by diagnosing two complex problems in the Condor distributed scheduling system.