A survey of online failure prediction methods

Authors:
Felix Salfner;Maren Lenk;Miroslaw Malek
Affiliations:
Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany;Humboldt-Universität zu Berlin, Germany
Venue:
ACM Computing Surveys (CSUR)
Year:
2010

Citing 61
Cited 20

Software reliability: measurement, prediction, application

Software reliability: measurement, prediction, application
Rough sets: probabilistic versus deterministic approach

International Journal of Man-Machine Studies
Detection of abrupt changes: theory and application

Detection of abrupt changes: theory and application
C4.5: programs for machine learning

C4.5: programs for machine learning
The nature of statistical learning theory

The nature of statistical learning theory
Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package

Performance and reliability analysis of computer systems: an example-based approach using the SHARPE software package
Handbook of software reliability engineering

Handbook of software reliability engineering
Software reliability and system reliability

Handbook of software reliability engineering
Software reliability modeling survey

Handbook of software reliability engineering
Techniques for prediction analysis and recalibration

Handbook of software reliability engineering
Software aging

ICSE '94 Proceedings of the 16th international conference on Software engineering
Reliable computer systems (3rd ed.): design and evaluation

Reliable computer systems (3rd ed.): design and evaluation
Internet service performance failure detection

ACM SIGMETRICS Performance Evaluation Review
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Recognition of error symptoms in large systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Information Retrieval

Information Retrieval
Dependability Measurement and Modeling of a Multicomputer System

IEEE Transactions on Computers
Efficient Data Mining for Path Traversal Patterns

IEEE Transactions on Knowledge and Data Engineering
Learning Logical Definitions from Relations

Machine Learning
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Error and Failure Analysis of a UNIX Server

HASE '98 The 3rd IEEE International Symposium on High-Assurance Systems Engineering
Optimal Discrimination between Transient and Permanent Faults

HASE '98 The 3rd IEEE International Symposium on High-Assurance Systems Engineering
Bayesian approaches to failure prediction for disk drives

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A Classification Approach for Prediction of Target Events in Temporal Sequences

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Advanced Pattern Recognition for Detection of Complex Software Aging Phenomena in Online Transaction Processing Servers

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Industry: predicting telecommunication equipment failures from sequences of network alarms

Handbook of data mining and knowledge discovery
Predictive Application-Performance Modeling in a Computational Grid Environment

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Proactive Detection of Software Aging Mechanisms in Performance Critical Computers

SEW '02 Proceedings of the 27th Annual NASA Goddard Software Engineering Workshop (SEW-27'02)
Predicting Rare Events In Temporal Domains

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
A Methodology for Detection and Estimation of Software Aging

ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
An Approach for Estimation of Software Aging in a Web Server

ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
An introduction to variable and feature selection

The Journal of Machine Learning Research
Early Warning of Failures through Alarm Analysis - A Case Study in Telecom Voice Mail Systems

ISSRE '03 Proceedings of the 14th International Symposium on Software Reliability Engineering
Anomalies as Precursors of Field Failures

ISSRE '03 Proceedings of the 14th International Symposium on Software Reliability Engineering
Critical event prediction for proactive management in large-scale computer clusters

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Fault Diagnosis: Models, Artificial Intelligence, Applications

Fault Diagnosis: Models, Artificial Intelligence, Applications
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Software failure prediction based on a Markov Bayesian network model

Journal of Systems and Software
Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization

ICAC '05 Proceedings of the Second International Conference on Automatic Computing
Autonomous recovery in componentized Internet applications

Cluster Computing
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Call Availability Prediction in a Telecommunication System: A Data Driven Empirical Approach

SRDS '06 Proceedings of the 25th IEEE Symposium on Reliable Distributed Systems
Software Aging Prediction Model Based on Fuzzy Wavelet Network with Adaptive Genetic Algorithm

ICTAI '06 Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence
A Best Practice Guide to Resources Forecasting for the Apache Webserver

PRDC '06 Proceedings of the 12th Pacific Rim International Symposium on Dependable Computing
Self-star Properties in Complex Information Systems: Conceptual and Practical Foundations (Lecture Notes in Computer Science)

Self-star Properties in Complex Information Systems: Conceptual and Practical Foundations (Lecture Notes in Computer Science)
Practical Statistics for Medical Research

Practical Statistics for Medical Research
Path-based faliure and evolution management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate

Computer
Using Hidden Semi-Markov Models for Effective Online Failure Prediction

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management

SRDS '07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems
Predictive algorithms in the management of computer systems

IBM Systems Journal
Proactive management of software aging

IBM Journal of Research and Development
Predicting failures of computer systems: a case study for a telecommunication system

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Prediction-Based software availability enhancement

Self-star Properties in Complex Information Systems
Fuzzy wavelet networks for function learning

IEEE Transactions on Fuzzy Systems
Detecting application-level failures in component-based Internet services

IEEE Transactions on Neural Networks

Fault prediction in distributed systems gone wild

Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware
Error detection framework for complex software systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Architecting dependable systems with proactive fault management

Architecting dependable systems VII
Towards IT systems capable of managing their health

FOCS'10 Proceedings of the 16th Monterey conference on Foundations of computer software: modeling, development, and verification of adaptive systems
Towards accurate failure prediction for the proactive adaptation of service-oriented systems

Proceedings of the 8th workshop on Assurances for self-adaptive systems
Event log mining tool for large scale HPC systems

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A statistical anomaly-based algorithm for on-line fault detection in complex software critical systems

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Smarter log analysis

IBM Journal of Research and Development
Long-term availability prediction for groups of volunteer resources

Journal of Parallel and Distributed Computing
QoS-Driven proactive adaptation of service composition

ICSOC'11 Proceedings of the 9th international conference on Service-Oriented Computing
Personal health record architectures: Technology infrastructure implications and dependencies

Journal of the American Society for Information Science and Technology
Predictive combinations of monitor alarms preceding in-hospital code blue events

Journal of Biomedical Informatics
Analysis and Evaluation of a New Algorithm Based Fault Tolerance for Computing Systems

International Journal of Grid and High Performance Computing
A survey and taxonomy of on-chip monitoring of multicore systems-on-chip

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A comparison of machine learning algorithms for proactive hard disk drive failure detection

Proceedings of the 4th international ACM Sigsoft symposium on Architecting critical systems
The Failure Trace Archive: Enabling the comparison of failure measurements and models of distributed systems

Journal of Parallel and Distributed Computing
An online failure prediction system for private IaaS platforms

Proceedings of the 2nd International Workshop on Dependability Issues in Cloud Computing
Failure prediction for HPC systems and applications: Current situation and open issues

International Journal of High Performance Computing Applications
Reliable workflow scheduling with less resource redundancy

Parallel Computing
Design and Evaluation of Techniques for Resilience and Survivability of the Routing Node

International Journal of Adaptive, Resilient and Autonomic Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the ever-growing complexity and dynamicity of computer systems, proactive fault management is an effective approach to enhancing availability. Online failure prediction is the key to such techniques. In contrast to classical reliability methods, online failure prediction is based on runtime monitoring and a variety of models and methods that use the current state of a system and, frequently, the past experience as well. This survey describes these methods. To capture the wide spectrum of approaches concerning this area, a taxonomy has been developed, whose different approaches are explained and major concepts are described in detail.