Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Software—Practice & Experience
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Libckpt: Transparent Checkpointing under Unix
Libckpt: Transparent Checkpointing under Unix
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault Tolerance in Cluster Federations with O2P-CF
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
The XtreemFS architecture—a case for object-based file systems in Grids
Concurrency and Computation: Practice & Experience - Selection of Best Papers of the VLDB Data Management in Grids Workshop (VLDB DMG 2007)
Journal of Parallel and Distributed Computing
DMTCP: Transparent checkpointing for cluster computations and the desktop
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The Architecture of the XtreemOS Grid Checkpointing Service
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Fault-tolerant replication based on fragmented objects
DAIS'06 Proceedings of the 6th IFIP WG 6.1 international conference on Distributed Applications and Interoperable Systems
Checkpointing and migration of communication channels in heterogeneous grid environments
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Future Generation Computer Systems
Hi-index | 0.00 |
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.