Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Component software: beyond object-oriented programming
Component software: beyond object-oriented programming
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Toward a Common Component Architecture for High-Performance Scientific Computing
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
System-Level Versus User-Defined Checkpointing
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
Supporting dynamic migration in tightly coupled grid applications
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
A technique for non-invasive application-level checkpointing
The Journal of Supercomputing
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Survey: Survey of fault tolerant techniques for grid
Computer Science Review
Hi-index | 0.00 |
With the advent of Grid computing, more and more high-end computational resources become available for use to a scientist. While this opens up new avenues for scientific research, it makes reliability and fault tolerance of such a system a non-trivial task, especially for long running distributed applications. In order to solve this problem, we present a distributed user-defined checkpointing mechanism within the XCAT3 system. XCAT3 is a framework for Component Component Architecture (CCA) based components consistent with current Grid standards. We describe in detail the algorithms and APIs that are added to XCAT3 in order to support distributed checkpointing. Our approach ensures that the checkpoints are platform independent, minimal in size, and always available during component failures. In addition, our algorithms maintain correctness in the presence of failures and scale well with the number of components, and checkpoint size.