3-Dimensional root cause diagnosis via co-analysis

Authors:
Ziming Zheng;Li Yu;Zhiling Lan;Terry Jones
Affiliations:
Illinois Institute of Technology, Chicago, IL, USA;Illinois Institute of Technology, Chicago, IL, USA;Illinois Institute of Technology, Chicago, IL, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 9th international conference on Autonomic computing
Year:
2012

Citing 25
Cited 0

Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
Data mining: concepts and techniques

Data mining: concepts and techniques
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Problem diagnosis in large-scale computing environments

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Using magpie for request extraction and workload modelling

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Analyzing system logs: a new view of what's important

SYSML'07 Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques
DMTracker: finding bugs in large-scale parallel programs by detecting anomaly in data movements

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Bad Words: Finding Faults in Spirit's Syslogs

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Early evaluation of IBM BlueGene/P

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Dustminer: troubleshooting interactive complexity bugs in sensor networks

Proceedings of the 6th ACM conference on Embedded network sensor systems
Petascale system management experiences

LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
HOLMES: Effective statistical debugging via efficient path profiling

ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Towards automated performance diagnosis in a large IPTV network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
Black-box problem diagnosis in parallel file systems

FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
Diagnosing performance changes by comparing request flows

Proceedings of the 8th USENIX conference on Networked systems design and implementation
Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

SLAML '11 Managing Large-scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques
Improving Log-based Field Failure Data Analysis of multi-node computing systems

DSN '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems&Networks
Identifying faults in large-scale distributed systems by filtering noisy error logs

DSNW '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops
Co-analysis of RAS Log and Job Log on Blue Gene/P

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks
Toward Automated Anomaly Identification in Large-Scale Systems

IEEE Transactions on Parallel and Distributed Systems
Filtering log data: Finding the needles in the Haystack

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Upon the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceability (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that causes the problem. We evaluate our mechanism by means of real logs collected from a production IBM Blue Gene/P system at Oak Ridge National Laboratory. It successfully identifies failure layer information for the failures during 23-month period. Furthermore, it effectively identifies the triggering events with time and location information, even when the triggering events occur hundreds of hours before the resulting failures.