Reliable computer systems (2nd ed.): design and evaluation
Reliable computer systems (2nd ed.): design and evaluation
Original Contribution: Stacked generalization
Neural Networks
Detection of abrupt changes: theory and application
Detection of abrupt changes: theory and application
The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
Software reliability and system reliability
Handbook of software reliability engineering
Software reliability modeling survey
Handbook of software reliability engineering
ICSE '94 Proceedings of the 16th international conference on Software engineering
Analysis of Preventive Maintenance in Transactions Based Software Systems
IEEE Transactions on Computers
Modeling and analysis of stochastic systems
Modeling and analysis of stochastic systems
Reliable computer systems (3rd ed.): design and evaluation
Reliable computer systems (3rd ed.): design and evaluation
Recognition of error symptoms in large systems
ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Reliability Issues in Computing System Design
ACM Computing Surveys (CSUR)
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
A perspective view and survey of meta-learning
Artificial Intelligence Review
Software reliability: The role of programmed exception handling
Proceedings of an ACM conference on Language design for reliable software
Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule
PRDC '00 Proceedings of the 2000 Pacific Rim International Symposium on Dependable Computing
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications
WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Can We Learn Anything from Hardware Preventive Maintenance?
ICECCS '01 Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems
Design and evaluation of an on-line predictive diagnostic system
Design and evaluation of an on-line predictive diagnostic system
An introduction to variable and feature selection
The Journal of Machine Learning Research
Radial Basis Functions
Early Warning of Failures through Alarm Analysis - A Case Study in Telecom Voice Mail Systems
ISSRE '03 Proceedings of the 14th International Symposium on Software Reliability Engineering
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
In Search of Real Data on Faults, Errors and Failures
EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach
SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
A Meta-Learning Failure Predictor for Blue Gene/L Systems
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A concise introduction to autonomic computing
Advanced Engineering Informatics
Predictive algorithms in the management of computer systems
IBM Systems Journal
Proactive management of software aging
IBM Journal of Research and Development
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Evaluating cooperative checkpointing for supercomputing systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system's complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system's architecture.