A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
UNIX Network Programming, Vol. 1
UNIX Network Programming, Vol. 1
Extending a Cluster SSI OS for Transparently Checkpointing Message-Passing Parallel Application
ISPAN '05 Proceedings of the 8th International Symposium on Parallel Architectures,Algorithms and Networks
DMTCP: Transparent checkpointing for cluster computations and the desktop
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The Architecture of the XtreemOS Grid Checkpointing Service
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
Hi-index | 0.00 |
A grid checkpointing service providing migration and transparent fault tolerance is important for distributed and parallel applications executed in heterogeneous grids In this paper we address the challenges of checkpointing and migrating communication channels of grid applications executed on nodes equipped with different checkpointer packages We present a solution that is transparent for the applications and the underlying checkpointers It also allows using single node checkpointers for distributed applications The measurement numbers show only a small overhead especially with respect to large grid-applications where checkpointing may consume many minutes.