Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
On the Reliability of the IBM MVS/XA Operating System
IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Improving cluster availability using workstation validation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
The Vision of Autonomic Computing
Computer
Networked Windows NT System Field Failure Data Analysis
PRDC '99 Proceedings of the 1999 Pacific Rim International Symposium on Dependable Computing
Predicting Rare Events In Temporal Domains
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Failure Data Analysis of a Large-Scale Heterogeneous Server Environment
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The dawning of the autonomic computing era
IBM Systems Journal
Filtering Failure Logs for a BlueGene/L Prototype
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Service Migration in Distributed Virtual Machines for Adaptive Grid Computing
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Empirical Assessment of Machine Learning based Software Defect Prediction Techniques
WORDS '05 Proceedings of the 10th IEEE International Workshop on Object-Oriented Real-Time Dependable Systems
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Decentralized Local Failure Detection in Dynamic Distributed Systems
SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
An SNMP based failure detection service
SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Stochastic modeling and analysis of hybrid mobility in reconfigurable distributed virtual machines
Journal of Parallel and Distributed Computing
FUSE: lightweight guaranteed distributed failure notification
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
BlackJack: Hard Error Detection with Redundant Threads on SMT
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Exploiting availability prediction in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management
SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Journal of Parallel and Distributed Computing
Predicting failures of computer systems: a case study for a telecommunication system
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dependability enhancement for coalition clusters with autonomic failure management
ISCC '10 Proceedings of the The IEEE symposium on Computers and Communications
Randomized load balancing strategies with churn resilience in peer-to-peer networks
Journal of Network and Computer Applications
A basic model for proactive event-driven computing
Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems
Failure-aware resource provisioning for hybrid Cloud infrastructure
Journal of Parallel and Distributed Computing
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering
A decentralized approach for mining event correlations in distributed system monitoring
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
Networked computing systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. Moreover, failure events exhibit strong correlations in the time and space domains. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to characterize spatial correlation. The models are further extended to take into account the information of application allocation to discover more correlations among failure instances. We cluster failure events based on their correlations and predict their future occurrences. Experimental results on a production coalition system, the Wayne State Computational Grid, show the offline and online predictions made by our predicting system can forecast 72.7-85.3% of the failure occurrences and capture failure correlations in a cluster coalition environment.