Necessary and Sufficient Conditions for Consistent Global Snapshots
IEEE Transactions on Parallel and Distributed Systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
Hi-index | 0.00 |
A distributed system employing checkpoint and rollback-recovery as a fault tolerance mechanism, suffers from overhead attributed by the technique. Authors in [4] proposes a technique to automatically identify a checkpoint and recovery protocol based on a pre-estimated database of overhead measures. The technique depends on computation of similarity between a pair of communication patterns. The computation involves first partitioning both the communication patterns into small pieces or splices. A pair of splices, one taken from each of the two communication patterns in question, are then compared to compute a similarity measure. Splicing a communication pattern is an important step in the method since it bears heavy significance for later steps in the computation. This paper introduces a new method for splicing. Experimental results show that the technique yields better similarity measure values in comparison to results reported in [4].