Finding a suitable checkpoint and recovery protocol for a distributed application

Authors:
Himadri Sekhar Paul;Arobinda Gupta;Amit Sharma
Affiliations:
Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati, Assam, India;Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur 721302, West Bengal, India;Department of Computer Science, University of Illinois at Urbana-Champaign, Illinois, USA
Venue:
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Year:
2006

Citing 14
Cited 1

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Understanding the message logging paradigm for masking process crashes

Understanding the message logging paradigm for masking process crashes
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
An Algorithm for Subgraph Isomorphism

Journal of the ACM (JACM)
A Fast Backtracking Algorithm to Test Directed Graphs for Isomorphism Using Distance Matrices

Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
The Cost of Recovery in Message Logging Protocols

IEEE Transactions on Knowledge and Data Engineering
The Performance of Coordinated and Independent Checkpointing

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
The inhibition spectrum and the achievement of causal consistency

Distributed Computing

Causal cycle based communication pattern matching

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. The performance of a checkpoint and recovery protocol is judged by the amount of computation it can save against the amount of overhead it incurs. This performance depends on different system and application characteristics, as well as protocol specific parameters. Hence, no single checkpoint and recovery protocol works equally well for all applications, and given a distributed application and a system it will run on, it is important to choose a protocol that will give the best performance for that system and application. In this paper, we present a scheme to automatically identify a suitable checkpoint and recovery protocol for a given distributed application running on a given system. The scheme involves a novel technique for finding the similarity between the communication pattern of two distributed applications that is of independent interest also. The similarity measure is based on a graph similarity problem. We present a heuristic for the graph similarity problem. Extensive experimental results are shown both for the graph similarity heuristic and the automatic identification scheme to show that an appropriate checkpoint and recovery protocol can be chosen automatically for a given application.