Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance in distributed systems
Fault tolerance in distributed systems
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
IEEE Transactions on Parallel and Distributed Systems
On Coordinated Checkpointing in Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
Advanced Concepts in Operating Systems
Advanced Concepts in Operating Systems
Asynchronous recovery without using vector timestamps
Journal of Parallel and Distributed Computing
An efficient end-host architecture for cluster communication
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Hybrid checkpointing for parallel applications in cluster federations
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
A novel recovery approach for cluster federations
GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
A low-overhead non-block checkpointing algorithm for mobile computing environment
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Novel Crash Recovery Approach for Concurrent Failures in Cluster Federation
GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
Hi-index | 0.00 |
In this paper, we have addressed the complex problem of recovery for concurrent failures in cluster computing environment. We have proposed a new approach in which we have dealt with both inter cluster orphan and lost messages unlike the existing works. The proposed recovery approach is free from the domino-effect and hence guarantees the least amount of recomputation after recovery. Besides, a process needs to save only its recent local checkpoint, which is also the case for a cluster. So number of trips to stable storage per process is always one during recovery. The proposed common check pointing interval is such that it enables a process to log the minimum number of messages it has sent. These features make our approach superior to the existing works.