Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
On the Reliability of the IBM MVS/XA Operating System
IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Vision of Autonomic Computing
Computer
Predicting Rare Events In Temporal Domains
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The dawning of the autonomic computing era
IBM Systems Journal
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Availability Modeling and Analysis on High Performance Cluster Computing Systems
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Decentralized Local Failure Detection in Dynamic Distributed Systems
SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
An SNMP based failure detection service
SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
BlackJack: Hard Error Detection with Redundant Threads on SMT
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Exploiting availability prediction in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study
Proceedings of the 2009 workshop on Resiliency in high performance
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Troubleshooting thousands of jobs on production grids using data mining techniques
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Supporting fault-tolerance for time-critical events in distributed environments
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Journal of Parallel and Distributed Computing
A study of dynamic meta-learning for failure prediction in large-scale systems
Journal of Parallel and Distributed Computing
Supporting fault-tolerance for time-critical events in distributed environments
Scientific Programming
Towards pro-active adaptation with confidence: augmenting service monitoring with online testing
Proceedings of the 2010 ICSE Workshop on Software Engineering for Adaptive and Self-Managing Systems
Quantifying event correlations for proactive failure management in networked computing systems
Journal of Parallel and Distributed Computing
Predicting computer system failures using support vector machines
WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
Online event correlations analysis in system logs of large-scale cluster systems
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Modelling pilot-job applications on production grids
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Randomized load balancing strategies with churn resilience in peer-to-peer networks
Journal of Network and Computer Applications
Failure-aware workflow scheduling in cluster environments
Cluster Computing
Vrisha: using scaling properties of parallel programs for bug detection and localization
Proceedings of the 20th international symposium on High performance distributed computing
Towards proactive event-driven computing
Proceedings of the 5th ACM international conference on Distributed event-based system
Event log mining tool for large scale HPC systems
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A model of pilot-job resource provisioning on production grids
Parallel Computing
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Failure prediction and localization in large scientific workflows
Proceedings of the 6th workshop on Workflows in support of large-scale science
Online workflow management and performance analysis with stampede
Proceedings of the 7th International Conference on Network and Services Management
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
A decentralized approach for mining event correlations in distributed system monitoring
Journal of Parallel and Distributed Computing
Proceedings of the 7th ACM international conference on Distributed event-based systems
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
A job submission manager for large-scale distributed systems based on job futurity predictor
International Journal of Grid and Utility Computing
Hi-index | 0.00 |
In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. We implemented a failure prediction framework, called PREdictor of Failure Events Correlated Temporal-Spatially (hPREFECTs), which explores correlations among failures and forecasts the time-between-failure of future instances. We evaluate the performance of hPREFECTs in both offline prediction of failure by using the Los Alamos HPC traces and online prediction in an institute-wide clusters coalition environment. Experimental results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction during the time from May 2006 to April 2007.