Visual and algorithmic tooling for system trace analysis: a case study

  • Authors:
  • Wim De Pauw;Stephen Heisig

  • Affiliations:
  • IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY

  • Venue:
  • ACM SIGOPS Operating Systems Review
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Despite advances in the application of automated statistical and machine learning techniques to system log and trace data there will always be a need for human analysis of machine traces, because trace information on unstable systems may be incomplete, or incorrect. In addition, false positives from automated analysis will not likely disappear, and remediation measures and candidate fix tests will need to be evaluated. We present Zinsight, a visual and analytic tool that supports performance analysts and debugging, using large event traces to understand complex systems. This tool enables analysts to quickly create and manipulate high-level structural representations linked with statistical analysis derived from the underlying event trace data. The original raw trace is annotated with module names and a domain specific database is incorporated to relate software functions to module names. Navigable sequence context graph views present automatically extracted execution flow patterns from arbitrarily definable sets of events and are linked to frequency, distribution, and response time views. The goal is to reduce the cognitive and computational load on the analyst while providing answers to the most natural questions in a problem determination session. We present a case study of the tool in use on field problems from the recently shipped (late 2008) IBM z10 mainframe. As a result of the industry trend toward higher parallelism and memory latency, many issues were encountered with legacy code. The tool was applied successfully to diagnose these problems.