The globus project: a status report
Future Generation Computer Systems - Special issue on metacomputing
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Virtual servers and checkpoint/restart in mainstream Linux
ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel
The XtreemFS architecture—a case for object-based file systems in Grids
Concurrency and Computation: Practice & Experience - Selection of Best Papers of the VLDB Data Management in Grids Workshop (VLDB DMG 2007)
Checkpointing Process Groups in a Grid Environment
PDCAT '08 Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies
Fault-tolerant replication based on fragmented objects
DAIS'06 Proceedings of the 6th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Checkpointing and migration of communication channels in heterogeneous grid environments
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
Hi-index | 0.00 |
The EU-funded XtreemOS project implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance for grid applications. Checkpointing and restarting applications in a grid requires saving and restoring distributed/parallel applications in distributed heterogeneous environments. In this paper we present the architecture of the XtreemGCP service integrating existing system-specific checkpointer solutions. We propose to bridge the gap between grid semantics and system-specific checkpointers by introducing a common kernel checkpointer API that allows using different checkpointers in a uniform way. Our architecture is open to support different checkpointing strategies that can be adapted according to evolving failure situations or changing application requirements. Finally, we discuss measurements numbers showing that the XtreemGGP architecture introduces only minimal overhead.