Architecting dependable systems with proactive fault management

Authors:
Felix Salfner;Miroslaw Malek
Affiliations:
Humboldt-Universität zu Berlin, Institut für Informatik, Berlin, Germany;Humboldt-Universität zu Berlin, Institut für Informatik, Berlin, Germany
Venue:
Architecting dependable systems VII
Year:
2010

Citing 39
Cited 0

Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
Original Contribution: Stacked generalization

Neural Networks
Detection of abrupt changes: theory and application

Detection of abrupt changes: theory and application
The consensus problem in fault-tolerant computing

ACM Computing Surveys (CSUR)
Software reliability and system reliability

Handbook of software reliability engineering
Software reliability modeling survey

Handbook of software reliability engineering
Software aging

ICSE '94 Proceedings of the 16th international conference on Software engineering
Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
Modeling and analysis of stochastic systems

Modeling and analysis of stochastic systems
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
Recognition of error symptoms in large systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Reliability Issues in Computing System Design

ACM Computing Surveys (CSUR)
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
A perspective view and survey of meta-learning

Artificial Intelligence Review
Using Control Theory to Achieve Service Level Objectives In Performance Management

Real-Time Systems
Software reliability: The role of programmed exception handling

Proceedings of an ACM conference on Language design for reliable software
Statistical non-parametric algorithms to estimate the optimal software rejuvenation schedule

PRDC '00 Proceedings of the 2000 Pacific Rim International Symposium on Dependable Computing
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications

WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Can We Learn Anything from Hardware Preventive Maintenance?

ICECCS '01 Proceedings of the Seventh International Conference on Engineering of Complex Computer Systems
Design and evaluation of an on-line predictive diagnostic system

Design and evaluation of an on-line predictive diagnostic system
An introduction to variable and feature selection

The Journal of Machine Learning Research
Radial Basis Functions

Radial Basis Functions
Early Warning of Failures through Alarm Analysis - A Case Study in Telecom Voice Mail Systems

ISSRE '03 Proceedings of the 14th International Symposium on Software Reliability Engineering
Improving availability with recursive microreboots: a soft-state system case study

Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
A Comprehensive Model for Software Rejuvenation

IEEE Transactions on Dependable and Secure Computing
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
In Search of Real Data on Faults, Errors and Failures

EDCC '06 Proceedings of the Sixth European Dependable Computing Conference
Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate

Computer
A Meta-Learning Failure Predictor for Blue Gene/L Systems

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
A concise introduction to autonomic computing

Advanced Engineering Informatics
Predictive algorithms in the management of computer systems

IBM Systems Journal
Proactive management of software aging

IBM Journal of Research and Development
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
Evaluating cooperative checkpointing for supercomputing systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Management of an ever-growing complexity of computing systems is an everlasting challenge for computer system engineers. We argue that we need to resort to predictive technologies in order to harness the system's complexity and transform a vision of proactive system and failure management into reality. We describe proactive fault management, provide an overview and taxonomy for online failure prediction methods and present a classification of failure prediction-triggered methods. We present a model to assess the effects of proactive fault management on system reliability and show that overall dependability can significantly be enhanced. After having shown the methods and potential of proactive fault management we describe a blueprint how proactive fault management can be incorporated into a dependable system's architecture.