Checkpointing an Recovery of Share Memory Parallel Applications in a Cluster
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Hi-index | 0.00 |
Cluster systems are becoming more prevalent in today’s computer society and users are beginning to request that these systems be reliable. Currently, most clusters have been designed to provide high performance at the cost of providing little to no reliability. To combat this, this report looks at how a recovery facility, based on either a centralised or distributed approach could be implemented into a cluster that is supported by a checkpointing facility. This recovery facility can then recover failed user processes by using checkpoints of the processes that have been taken during failure free execution.