Techniques for debugging parallel programs with flowback analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
An open graph visualization system and its applications to software engineering
Software—Practice & Experience - Special issue on discrete algorithm engineering
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Finding failures by cluster analysis of execution profiles
ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Isolating failure-inducing thread schedules
ISSTA '02 Proceedings of the 2002 ACM SIGSOFT international symposium on Software testing and analysis
Visualization of test information to assist fault localization
Proceedings of the 24th International Conference on Software Engineering
Isolating cause-effect chains from computer programs
Proceedings of the 10th ACM SIGSOFT symposium on Foundations of software engineering
DPM: A Measurement System for Distributed Programs
IEEE Transactions on Computers
ICSM '93 Proceedings of the Conference on Software Maintenance
UNIX Network Programming, Vol. 1
UNIX Network Programming, Vol. 1
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Stateful distributed interposition
ACM Transactions on Computer Systems (TOCS)
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Scalable statistical bug isolation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
TraceBack: first fault diagnosis by reconstruction of distributed control flow
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Capturing, indexing, clustering, and retrieving system history
Proceedings of the twentieth ACM symposium on Operating systems principles
Stardust: tracking activity in a distributed storage system
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
SysProf: Online Distributed Behavior Diagnosis through Fine-grain System Monitoring
ICDCS '06 Proceedings of the 26th IEEE International Conference on Distributed Computing Systems
A comparison of software and hardware techniques for x86 virtualization
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Problem diagnosis in large-scale computing environments
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automated known problem diagnosis with event traces
Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
Magpie: online modelling and performance-aware systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
DIGITAL FX!32 running 32-bit ×86 applications on alpha NT
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Pip: detecting the unexpected in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Whodunit: transactional profiling for multi-tier applications
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Triage: diagnosing production run failures at the user's site
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Causeway: support for controlling and analyzing the execution of multi-tier applications
Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A generic solution for agile run-time inspection middleware
Middleware'11 Proceedings of the 12th ACM/IFIP/USENIX international conference on Middleware
A multi-level monitoring framework for stream-based coordination programs
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Automated root cause isolation of performance regressions during software development
Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering
Hi-index | 0.00 |
We present a three-part approach for diagnosing bugs and performance problems in production distributed environments. First, we introduce a novel execution monitoring technique that dynamically injects a fragment of code, the agent, into an application process on demand. The agent inserts instrumentation ahead of the control flow within the process and propagates into other processes, following communication events, crossing host boundaries, and collecting a distributed function-level trace of the execution. Second, we present an algorithm that separates the trace into user-meaningful activities called flows. This step simplifies manual examination and enables automated analysis of the trace. Finally, we describe our automated root cause analysis technique that compares the flows to help the analyst locate an anomalous flow and identify a function in that flow that is a likely cause of the anomaly. We demonstrate the effectiveness of our techniques by diagnosing two complex problems in the Condor distributed scheduling system.