Domino-effect free crash recovery for concurrent failures in cluster federation

Authors:
Bidyut Gupta;Shahram Rahimi;Vineel Allam;Vamshi Jupally
Affiliations:
Computer Science Department, Southern Illinois University, Carbondale, IL;Computer Science Department, Southern Illinois University, Carbondale, IL;Computer Science Department, Southern Illinois University, Carbondale, IL;Computer Science Department, Southern Illinois University, Carbondale, IL
Venue:
GPC'08 Proceedings of the 3rd international conference on Advances in grid and pervasive computing
Year:
2008

Citing 12
Cited 1

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance in distributed systems

Fault tolerance in distributed systems
Optimistic Crash Recovery without Changing Application Messages

IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Theoretical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability

IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Advanced Concepts in Operating Systems

Advanced Concepts in Operating Systems
Asynchronous recovery without using vector timestamps

Journal of Parallel and Distributed Computing
An efficient end-host architecture for cluster communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Hybrid checkpointing for parallel applications in cluster federations

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A novel recovery approach for cluster federations

GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
A low-overhead non-block checkpointing algorithm for mobile computing environment

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing

Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation

GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we have addressed the complex problem of recovery for concurrent failures in cluster computing environment. We have proposed a new approach in which we have dealt with both inter cluster orphan and lost messages unlike the existing works. The proposed recovery approach is free from the domino-effect and hence guarantees the least amount of recomputation after recovery. Besides, a process needs to save only its recent local checkpoint, which is also the case for a cluster. So number of trips to stable storage per process is always one during recovery. The proposed common check pointing interval is such that it enables a process to log the minimum number of messages it has sent. These features make our approach superior to the existing works.