Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation

Authors:
Bidyut Gupta;Shahram Rahimi
Affiliations:
Department of Computer Science, Southern Illinois University, Carbondale, USA IL 62901;Department of Computer Science, Southern Illinois University, Carbondale, USA IL 62901
Venue:
GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
Year:
2009

Citing 13
Cited 0

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
The duality of fault-tolerant system structures

Software—Practice & Experience
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems

IEEE Transactions on Parallel and Distributed Systems
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
Efficient Algorithms for Crash Recovery in Distributed Systems

Proceedings of the Tenth Conference on Foundations of Software Technology and Theoretical Computer Science
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
An efficient end-host architecture for cluster communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Hybrid checkpointing for parallel applications in cluster federations

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A novel recovery approach for cluster federations

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
Domino-effect free crash recovery for concurrent failures in cluster federation

GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
A low-overhead non-block checkpointing algorithm for mobile computing environment

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we have proposed a simple and efficient approach for check pointing and recovery in cluster computing environment. The recovery scheme deals with both orphan and lost intra and inter cluster messages. This check pointing scheme ensures that after the system recovers from failures, all processes in different clusters can restart from their respective recent checkpoints; thus avoiding any domino effect. That is, the recent check points always form a consistent recovery line of the cluster federation. The main features of our work are: it uses selective message logging which enables the initiator process in each cluster to log the minimum number of messages, the recovery scheme is domino effect free and is executed simultaneously by all clusters in the cluster federation, it considers concurrent failures, message complexities in each cluster for both check pointing and recovery schemes are just O (n), where n is the number of processes in a cluster.These features make our algorithm superior to the existing works.