Problem diagnosis in large-scale computing environments

Authors:
Alexander V. Mirgorodskiy;Naoya Maruyama;Barton P. Miller
Affiliations:
VMware, Inc.;Tokyo Institute of Technology;University of Wisconsin
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 15
Cited 22

Improving the accuracy of data race detection

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Eraser: a dynamic data race detector for multi-threaded programs

Proceedings of the sixteenth ACM symposium on Operating systems principles
OCM—a monitoring system for interoperable tools

SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding failures by cluster analysis of execution profiles

ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
The Paradyn Parallel Performance Measurement Tool

Computer
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Sense of Self for Unix Processes

SP '96 Proceedings of the 1996 IEEE Symposium on Security and Privacy
Intrusion Detection via Static Analysis

SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Mining distance-based outliers in near linear time with randomization and a simple pruning rule

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable statistical bug isolation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Failure Diagnosis Using Decision Trees

ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Magpie: online modelling and performance-aware systems

HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Correlating instrumentation data to system states: a building block for automated diagnosis and control

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6

DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
D3S: debugging deployed distributed systems

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning

ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
On-Line Performance Modeling for MPI Applications

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Diagnosing distributed systems with self-propelled instrumentation

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Log summarization and anomaly detection for troubleshooting distributed systems

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Elicitation and utilization of application-level utility functions

ICAC '09 Proceedings of the 6th international conference on Autonomic computing
AVA: automated interpretation of dynamically detected anomalies

Proceedings of the eighteenth international symposium on Software testing and analysis
Monitoring MPI programs for performance characterization and management control

Proceedings of the 2010 ACM Symposium on Applied Computing
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mining invariants from console logs for system problem detection

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Behavior-based problem localization for parallel file systems

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Vrisha: using scaling properties of parallel programs for bug detection and localization

Proceedings of the 20th international symposium on High performance distributed computing
Synthesis of application-level utility functions for autonomic self-assessment

Cluster Computing
Large scale debugging of parallel tasks with AutomaDeD

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Probabilistic diagnosis of performance faults in large-scale parallel applications

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
3-Dimensional root cause diagnosis via co-analysis

Proceedings of the 9th international conference on Autonomic computing
Automated tracing and visualization of software security structure and properties

Proceedings of the Ninth International Symposium on Visualization for Cyber Security
ABHRANTA: locating bugs that manifest at large system scales

HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
WuKong: automatically detecting and localizing bugs that manifest at large system scales

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe a new approach for locating the causes of anomalies in distributed systems. Our target environment is a distributed application that contains multiple identical processes performing similar activities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped earlier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore.