Checkpointing algorithms and fault prediction

Authors:
Guillaume Aupy;Yves Robert;Frédéric Vivien;Dounia Zaidouni
Affiliations:
-;-;-;-
Venue:
Journal of Parallel and Distributed Computing
Year:
2014

Citing 20
Cited 0

A first order approximation to the optimum checkpoint interval

Communications of the ACM
Improving cluster availability using workstation validation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
On the Choice of Checkpoint Interval Using Memory Usage Profile and Adaptive Time Series Analysis

PRDC '01 Proceedings of the 2001 Pacific Rim International Symposium on Dependable Computing
Probability and Computing: Randomized Algorithms and Probabilistic Analysis

Probability and Computing: Randomized Algorithms and Probabilistic Analysis
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Failure Prediction in IBM BlueGene/L Event Logs

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Fault-Aware Runtime Strategies for High-Performance Computing

IEEE Transactions on Parallel and Distributed Systems
Proactive management of software aging

IBM Journal of Research and Development
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Predicting computer system failures using support vector machines

WASL'08 Proceedings of the First USENIX conference on Analysis of system logs
A practical failure prediction with location and lead time for Blue Gene/P

DSNW '10 Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W)
Practical online failure prediction for Blue Gene/P: Period-based vs event-driven

DSNW '11 Proceedings of the 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Modeling and tolerating heterogeneous failures in large parallel systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Design and modeling of a non-blocking checkpointing system

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide optimal algorithms to decide whether and when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow us to analytically assess the key parameters that impact the performance of fault predictors at very large scale.