Diagnosing Distributed Systems with Self-propelled Instrumentation

Authors:
Alexander V. Mirgorodskiy;Barton P. Miller
Affiliations:
VMware, Inc., ;Computer Sciences Dept, University of Wisconsin,
Venue:
Middleware '08 Proceedings of the ACM/IFIP/USENIX 9th International Middleware Conference
Year:
2008

Citing 0
Cited 5

A domain specific aspect language for run-time inspection

Proceedings of the seventh workshop on Domain-Specific Aspect Languages
Automated tracing and visualization of software security structure and properties

Proceedings of the Ninth International Symposium on Visualization for Cyber Security
A generic solution for agile run-time inspection middleware

Proceedings of the 12th International Middleware Conference
Performance problem diagnostics by systematic experimentation

Proceedings of the 18th international doctoral symposium on Components and architecture
Supporting swift reaction: automatically uncovering performance problems by systematic experiments

Proceedings of the 2013 International Conference on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a three-part approach for diagnosing bugs and performance problems in production distributed environments. First, we introduce a novel execution monitoring technique that dynamically injects a fragment of code, the agent, into an application process on demand. The agent inserts instrumentation ahead of the control flow within the process and propagates into other processes, following communication events, crossing host boundaries, and collecting a distributed function-level trace of the execution. Second, we present an algorithm that separates the trace into user-meaningful activities called flows. This step simplifies manual examination and enables automated analysis of the trace. Finally, we describe our automated root cause analysis technique that compares the flows to help the analyst locate an anomalous flow and identify a function in that flow that is a likely cause of the anomaly. We demonstrate the effectiveness of our techniques by diagnosing two complex problems in the Condor distributed scheduling system.