An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine

Authors:
Luís Moura Silva;João Gabriel Silva
Affiliations:
-;-
Venue:
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Year:
1999

Citing 7
Cited 0

Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
RENEW: A Tool for Fast and Efficient Implementation of Checkpoint Protocols

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Portable checkpointing and recovery

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
On Staggered Checkpointing

SPDP '96 Proceedings of the 8th IEEE Symposium on Parallel and Distributed Processing (SPDP '96)
The performance of consistent checkpointing in distributed shared memory systems

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Coordinated checkpointing represents a very effective solution to assure the continuity of distributed and parallel applications in the occurrence of failures. In previous studies it has been proved that this approach achieved better results than independent checkpointing and message logging. However, we need to know more about the real overhead of coordinated checkpointing and get sustained insights about the best way to implement this technique of fault-tolerance. This paper presents an experimental evaluation of coordinated checkpointing in a parallel machine. It describes some optimization techniques and presents some performance results.