Implementing rollback-recovery coordinated checkpoints

  • Authors:
  • Clairton Buligon;Sérgio Cechin;Ingrid Jansch-Pôrto

  • Affiliations:
  • Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil;Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil;Graduate Program in Computer Science, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, RS, Brazil

  • Venue:
  • ISSADS'05 Proceedings of the 5th international conference on Advanced Distributed Systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recovering from processor failures in distributed systems is an important problem in the design of reliable systems. The processes should coordinate their operation to guarantee that the set of local checkpoints taken by the individual processes form a consistent global checkpoint (recovery line). This allows the system to resume operation from a consistent global state, when recovering from failure. This paper shows the results of the implementation of a transparent (no special needs for applications) and coordinated (non blocking) rollback-recovery distributed algorithm. As it does not block applications, the overhead is reduced during failure-free operation. Furthermore, the rollback procedure can be executed fast as a recovery line is always available and well identified. Our preliminary experimental results show that the algorithm causes very low overhead on the performance (less than 2%), and high dependency on the checkpoint size. Now we study optimizations on the implementation to reduce checkpoint latency.