Implementing rollback-recovery coordinated checkpoints

Authors:
Clairton Buligon;Sérgio Cechin;Ingrid Jansch-Pôrto
Affiliations:
Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil;Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil;Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil
Venue:
ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
Year:
2005

Citing 11
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
The temporal logic of actions

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fault tolerance in distributed systems

Fault tolerance in distributed systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
An Analysis of Communication-Induced Checkpointing

An Analysis of Communication-Induced Checkpointing
Communication-based prevention of useless checkpoints in distributed computations

Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local checkpoints taken by the individual processes form a consistent global checkpoint (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency.