Improving the accuracy of data race detection
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Eraser: a dynamic data race detector for multi-threaded programs
Proceedings of the sixteenth ACM symposium on Operating systems principles
OCM—a monitoring system for interoperable tools
SPDT '98 Proceedings of the SIGMETRICS symposium on Parallel and distributed tools
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Finding failures by cluster analysis of execution profiles
ICSE '01 Proceedings of the 23rd International Conference on Software Engineering
Pinpoint: Problem Determination in Large, Dynamic Internet Services
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
A Sense of Self for Unix Processes
SP '96 Proceedings of the 1996 IEEE Symposium on Security and Privacy
Intrusion Detection via Static Analysis
SP '01 Proceedings of the 2001 IEEE Symposium on Security and Privacy
Mining distance-based outliers in near linear time with randomization and a simple pruning rule
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Scalable statistical bug isolation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Magpie: online modelling and performance-aware systems
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Using magpie for request extraction and workload modelling
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
D3S: debugging deployed distributed systems
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Grid Application Fault Diagnosis Using Wrapper Services and Machine Learning
ICSOC '07 Proceedings of the 5th international conference on Service-Oriented Computing
On-Line Performance Modeling for MPI Applications
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Adaptive Monitoring with Dynamic Differential Tracing-Based Diagnosis
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
Diagnosing distributed systems with self-propelled instrumentation
Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Log summarization and anomaly detection for troubleshooting distributed systems
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
Elicitation and utilization of application-level utility functions
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
AVA: automated interpretation of dynamically detected anomalies
Proceedings of the eighteenth international symposium on Software testing and analysis
Monitoring MPI programs for performance characterization and management control
Proceedings of the 2010 ACM Symposium on Applied Computing
Black-box problem diagnosis in parallel file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Mining invariants from console logs for system problem detection
USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
FlowChecker: Detecting Bugs in MPI Libraries via Message Flow Checking
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Behavior-based problem localization for parallel file systems
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Vrisha: using scaling properties of parallel programs for bug detection and localization
Proceedings of the 20th international symposium on High performance distributed computing
Large scale debugging of parallel tasks with AutomaDeD
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Probabilistic diagnosis of performance faults in large-scale parallel applications
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
3-Dimensional root cause diagnosis via co-analysis
Proceedings of the 9th international conference on Autonomic computing
Automated tracing and visualization of software security structure and properties
Proceedings of the Ninth International Symposium on Visualization for Cyber Security
ABHRANTA: locating bugs that manifest at large system scales
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
WuKong: automatically detecting and localizing bugs that manifest at large system scales
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Hi-index | 0.00 |
We describe a new approach for locating the causes of anomalies in distributed systems. Our target environment is a distributed application that contains multiple identical processes performing similar activities. We use a new, lightweight form of dynamic instrumentation to collect function-level traces from each process. If the application fails, the traces are automatically compared to each other. We find anomalies by identifying processes that stopped earlier than the rest (sign of a fail-stop problem) or processes that behaved different from the rest (sign of a non-fail-stop problem). Our algorithm does not require reference data to distinguish anomalies from normal behaviors. However, it can make use of such data when available to reduce the number of false positives. Ultimately, we identify a function that is likely to explain the anomalous behavior. We demonstrated the efficacy of our approach by finding two problems in a large distributed cluster environment called SCore.