Finding a suitable checkpoint and recovery protocol for a distributed application

  • Authors:
  • Himadri Sekhar Paul;Arobinda Gupta;Amit Sharma

  • Affiliations:
  • Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati, Assam, India;Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721302, West Bengal, India;Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA

  • Venue:
  • Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. The performance of a checkpoint and recovery protocol is judged by the amount of computation it can save against the amount of overhead it incurs. This performance depends on different system and application characteristics, as well as protocol specific parameters. Hence, no single checkpoint and recovery protocol works equally well for all applications, and given a distributed application and a system it will run on, it is important to choose a protocol that will give the best performance for that system and application. In this paper, we present a scheme to automatically identify a suitable checkpoint and recovery protocol for a given distributed application running on a given system. The scheme involves a novel technique for finding the similarity between the communication pattern of two distributed applications that is of independent interest also. The similarity measure is based on a graph similarity problem. We present a heuristic for the graph similarity problem. Extensive experimental results are shown both for the graph similarity heuristic and the automatic identification scheme to show that an appropriate checkpoint and recovery protocol can be chosen automatically for a given application.