Testing Linear Temporal Logic Formulae on Finite Execution Traces
Testing Linear Temporal Logic Formulae on Finite Execution Traces
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
An Overview of the Runtime Verification Tool Java PathExplorer
Formal Methods in System Design
Model-Checking Based Data Retrieval: An Application to Semistructured and Temporal Data (Lecture Notes in Computer Science, 2917)
A Taxonomy and Catalog of Runtime Software-Fault Monitoring Tools
IEEE Transactions on Software Engineering
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Monitoring Large Systems Via Statistical Sampling
International Journal of High Performance Computing Applications
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Bad Words: Finding Faults in Spirit's Syslogs
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Failure Prediction in IBM BlueGene/L Event Logs
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Alert Detection in System Logs
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
An analysis of clustered failures on large supercomputing systems
Journal of Parallel and Distributed Computing
Monitoring Algorithms for Metric Temporal Logic Specifications
Electronic Notes in Theoretical Computer Science (ENTCS)
Predicting computer system failures using support vector machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
A clustering model based on matrix approximation with applications to cluster system log files
ECML'05 Proceedings of the 16th European conference on Machine Learning
Enabling comprehensive data-driven system management for large computational facilities
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The scale and complexity of both hardware and software on large open source software systems such as Ranger make occurrence of faults and failures inevitable. What is not inevitable is that they should be allowed to go undetected, nor that diagnosis and recovery from failures should continue to be largely manual and effort intensive. This paper presents a framework for end-to-end fault management for open source clusters which is being developed on Ranger, but which targets general open source software based clusters. The elements of the framework are: a rationalized system logging stack for Linux, low overhead log and status monitoring, and a multilevel suite of diagnostic analyses. This paper describes this framework, presents the accomplishments to date, the results which have been obtained with the elements of the framework which are in place, and the plans for future development including a solicitation for collaboration on the project.