End-to-end framework for fault management for open source clusters: Ranger

Authors:
John L. Hammond;Tommy Minyard;Jim Browne
Affiliations:
ICES, University of Texas, Austin, Texas;TACC, University of Texas, Austin, Texas;University of Texas, Austin, Texas
Venue:
Proceedings of the 2010 TeraGrid Conference
Year:
2010

Citing 16
Cited 1

Testing Linear Temporal Logic Formulae on Finite Execution Traces

Testing Linear Temporal Logic Formulae on Finite Execution Traces
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
An Overview of the Runtime Verification Tool Java PathExplorer

Formal Methods in System Design
Model-Checking Based Data Retrieval: An Application to Semistructured and Temporal Data (Lecture Notes in Computer Science, 2917)

Model-Checking Based Data Retrieval: An Application to Semistructured and Temporal Data (Lecture Notes in Computer Science, 2917)
A Taxonomy and Catalog of Runtime Software-Fault Monitoring Tools

IEEE Transactions on Software Engineering
Filtering Failure Logs for a BlueGene/L Prototype

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Monitoring Large Systems Via Statistical Sampling

International Journal of High Performance Computing Applications
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Bad Words: Finding Faults in Spirit's Syslogs

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Failure Prediction in IBM BlueGene/L Event Logs

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Alert Detection in System Logs

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Monitoring Algorithms for Metric Temporal Logic Specifications

Electronic Notes in Theoretical Computer Science (ENTCS)
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
A clustering model based on matrix approximation with applications to cluster system log files

ECML'05 Proceedings of the 16th European conference on Machine Learning

Enabling comprehensive data-driven system management for large computational facilities

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The scale and complexity of both hardware and software on large open source software systems such as Ranger make occurrence of faults and failures inevitable. What is not inevitable is that they should be allowed to go undetected, nor that diagnosis and recovery from failures should continue to be largely manual and effort intensive. This paper presents a framework for end-to-end fault management for open source clusters which is being developed on Ranger, but which targets general open source software based clusters. The elements of the framework are: a rationalized system logging stack for Linux, low overhead log and status monitoring, and a multilevel suite of diagnostic analyses. This paper describes this framework, presents the accomplishments to date, the results which have been obtained with the elements of the framework which are in place, and the plans for future development including a solicitation for collaboration on the project.