Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Journal of Systems and Software
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Towards an operating system managing parallelism of computing on clusters
Future Generation Computer Systems
A Cluster Operating System Supporting Parallel Computing
Cluster Computing
Remote and Concurrent Process Duplication for SPMD Based Parallel Processing on COWs
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Design, Implementation, and Performance of Checkpointing in NetSolve
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
System-Level Versus User-Defined Checkpointing
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Finding, expressing and managing parallelism in programs executed on clusters of workstations
Computer Communications
2-step algorithm for enhancing effectiveness of sender-based message logging
SpringSim '07 Proceedings of the 2007 spring simulation multiconference - Volume 2
Novel log management for sender-based message logging
ICAI'08 Proceedings of the 9th WSEAS International Conference on International Conference on Automation and Information
WSEAS Transactions on Computers
A comparative study at the logical level of centralised and distributed recovery in clusters
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
An efficient algorithm for removing useless logged messages in SBML protocols
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Hi-index | 0.00 |
Recent research efforts of parallel processing on non-dedicated clusters have focused on high execution performance, parallelism management, transparent access to resources, and making clusters easy to use. However, as a collection of independent computers used by multiple users, clusters are susceptible to failure. This paper shows the development of a coordinated checkpointing facility for the GENESIS cluster operating system. This facility was developed by exploiting existing operating system services. High performance and low overheads are achieved by allowing the processes of a parallel application to continue executing during the creation of checkpoints, while maintaining low demands on cluster resources by using coordinated checkpointing.