Communication Pattern Based Checkpointing Coordination for Fault-Tolerant Distributed Computing Systems

  • Authors:
  • Taesoon Park;Heon Y. Yeom

  • Affiliations:
  • -;-

  • Venue:
  • ICOIN '98 Proceedings of the 13th International Conference on Information Networking
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: This paper presents a new checkpointing coordination scheme which utilizes the information regarding the communication pattern of the target program. We have classified the communication patterns of the processes and found that in most cases, the dependency relation which might cause the cascading rollbacks, called a domino effect, involves only two processes. For such cases, we suggest a cycle detection scheme to prevent the domino effect. Even in other cases, the limited number of processes are mostly involved in the domino effect. Hence, we also suggest the limited coordination scheme in which the coordination involves only the processes specified in the communication pattern. By utilizing the communication pattern of the target program, it is possible to remove the unnecessary coordination effort and the checkpointing frequency can also be reduced. One possible drawback of the proposed scheme is that the rollback distance might get longer in some cases. However, the difference is minimal and we believe that it is a small price at the failure time, compared with the reduced overhead during the normal execution. Extensive simulation has been performed to evaluate the performance of the proposed scheme and we concluded that the proposed scheme significantly reduces the checkpointing overhead compared with the loose coordination schemes.