Log summarization and anomaly detection for troubleshooting distributed systems

Authors:
Dan Gunter;Brian L. Tierney;Aaron Brown;Martin Swany;John Bresnahan;Jennifer M. Schopf
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;University of Delaware, Newark, USA;University of Delaware, Newark, USA;Argonne National Laboratory, Argonne, IL, USA;Argonne National Laboratory, Argonne, IL, USA
Venue:
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Year:
2007

Citing 13
Cited 6

An Intrusion-Detection Model

IEEE Transactions on Software Engineering - Special issue on computer security and privacy
Multivariate resource performance forecasting in the network weather service

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Dynamic Monitoring of High-Performance Distributed Applications

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
On-Demand Grid Application Tuning and Debugging with the NetLogger Activation Service

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
The Grid2003 Production Grid: Principles and Practice

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
An integrated experimental environment for distributed systems and networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Ensembles of Models for Automated Diagnosis of System Performance Problems

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
The GrADS Project: Software Support for High-Level Grid Application Development

International Journal of High Performance Computing Applications
The Globus Striped GridFTP Framework and Server

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Using Dynamic Tracing Sampling to Measure Long Running Programs

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Detecting performance anomalies in global applications

WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Anomaly management in grid environments

Anomaly management in grid environments

Troubleshooting thousands of jobs on production grids using data mining techniques

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Decentralized log event correlation architecture

Proceedings of the International Conference on Management of Emergent Digital EcoSystems
Hunting for problems with Artemis

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Instrumentation-based tool for latency measurements

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
System log summarization via semi-Markov models of inter-arrival times

Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Today’s system monitoring tools are capable of detecting system failures such as host failures, OS errors, and network partitions in near-real time. Unfortunately, the same cannot yet be said of the end-to-end distributed software stack. Any given action, for example, reliably transferring a directory of files, can involve a wide range of complex and interrelated actions across multiple pieces of software: checking user certificates and permissions, getting details for all files, performing third-party transfers, understanding re-try policy decisions, etc. We present an infrastructure for troubleshooting complex middleware, a general purpose technique for configurable log summarization, and an anomaly detection technique that works in near-real time on running Grid middleware. We present results gathered using this infrastructure from instrumented Grid middleware and applications running on the Emulab testbed. From these results, we analyze the effectiveness of several algorithms at accurately detecting a variety of performance anomalies.