TraceBack: first fault diagnosis by reconstruction of distributed control flow

  • Authors:
  • Andrew Ayers;Richard Schooler;Chris Metcalf;Anant Agarwal;Junghwan Rhee;Emmett Witchel

  • Affiliations:
  • Microsoft Corporation;Microsoft Corporation;VERITAS Software / MIT CSAIL;VERITAS Software / MIT CSAIL;University of Texas at Austin;University of Texas at Austin

  • Venue:
  • Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Faults that occur in production systems are the most important faults to fix, but most production systems lack the debugging facilities present in development environments. TraceBack provides debugging information for production systems by providing execution history data about program problems (such as crashes, hangs, and exceptions). TraceBack supports features commonly found in production environments such as multiple threads, dynamically loaded modules, multiple source languages (e.g., Java applications running with JNI modules written in C++), and distributed execution across multiple computers. TraceBack supports first fault diagnosis-discovering what went wrong the first time a fault is encountered. The user can see how the program reached the fault state without having to re-run the computation; in effect enabling a limited form of a debugger in production code.TraceBack uses static, binary program analysis to inject low-overhead runtime instrumentation at control-flow block granularity. Post-facto reconstruction of the records written by the instrumentation code produces a source-statement trace for user diagnosis. The trace shows the dynamic instruction sequence leading up to the fault state, even when the program took exceptions or terminated abruptly (e.g., kill -9).We have implemented TraceBack on a variety of architectures and operating systems, and present examples from a variety of platforms. Performance overhead is variable, from 5% for Apache running SPECweb99, to 16%-25% for the Java SPECJbb benchmark, to 60% average for SPECint2000. We show examples of TraceBack's cross-language and cross-machine abilities, and report its use in diagnosing problems in production software.