Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Understanding the message logging paradigm for masking process crashes
Understanding the message logging paradigm for masking process crashes
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
An Algorithm for Subgraph Isomorphism
Journal of the ACM (JACM)
A Fast Backtracking Algorithm to Test Directed Graphs for Isomorphism Using Distance Matrices
Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
The Cost of Recovery in Message Logging Protocols
IEEE Transactions on Knowledge and Data Engineering
The Performance of Coordinated and Independent Checkpointing
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
A low-overhead recovery technique using quasi-synchronous checkpointing
ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
The inhibition spectrum and the achievement of causal consistency
Distributed Computing
Causal cycle based communication pattern matching
ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Hi-index | 0.00 |
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. The performance of a checkpoint and recovery protocol is judged by the amount of computation it can save against the amount of overhead it incurs. This performance depends on different system and application characteristics, as well as protocol specific parameters. Hence, no single checkpoint and recovery protocol works equally well for all applications, and given a distributed application and a system it will run on, it is important to choose a protocol that will give the best performance for that system and application. In this paper, we present a scheme to automatically identify a suitable checkpoint and recovery protocol for a given distributed application running on a given system. The scheme involves a novel technique for finding the similarity between the communication pattern of two distributed applications that is of independent interest also. The similarity measure is based on a graph similarity problem. We present a heuristic for the graph similarity problem. Extensive experimental results are shown both for the graph similarity heuristic and the automatic identification scheme to show that an appropriate checkpoint and recovery protocol can be chosen automatically for a given application.