Log summarization and anomaly detection for troubleshooting distributed systems

  • Authors:
  • Dan Gunter;Brian L. Tierney;Aaron Brown;Martin Swany;John Bresnahan;Jennifer M. Schopf

  • Affiliations:
  • Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;University of Delaware, Newark, USA;University of Delaware, Newark, USA;Argonne National Laboratory, Argonne, IL, USA;Argonne National Laboratory, Argonne, IL, USA

  • Venue:
  • GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Today’s system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.