Checkpoint and Restart for Distributed Components in XCAT3

Authors:
Sriram Krishnan;Dennis Gannon
Affiliations:
Indiana University, Bloomington, IN;Indiana University, Bloomington, IN
Venue:
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Year:
2004

Citing 8
Cited 5

Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Component software: beyond object-oriented programming

Component software: beyond object-oriented programming
Programming the Grid: Distributed Software Components, P2P and Grid Web Services for Scientific Applications

Cluster Computing
Grid Services for Distributed System Integration

Computer
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Resource Management Architecture for Metacomputing Systems

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Toward a Common Component Architecture for High-Performance Scientific Computing

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
System-Level Versus User-Defined Checkpointing

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems

Supporting dynamic migration in tightly coupled grid applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Performance evaluation of an application-level checkpointing solution on grids

Future Generation Computer Systems
A technique for non-invasive application-level checkpointing

The Journal of Supercomputing
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
Survey: Survey of fault tolerant techniques for grid

Computer Science Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of Grid computing, more and more high-end computational resources become available for use to a scientist. While this opens up new avenues for scientific research, it makes reliability and fault tolerance of such a system a non-trivial task, especially for long running distributed applications. In order to solve this problem, we present a distributed user-defined checkpointing mechanism within the XCAT3 system. XCAT3 is a framework for Component Component Architecture (CCA) based components consistent with current Grid standards. We describe in detail the algorithms and APIs that are added to XCAT3 in order to support distributed checkpointing. Our approach ensures that the checkpoints are platform independent, minimal in size, and always available during component failures. In addition, our algorithms maintain correctness in the presence of failures and scale well with the number of components, and checkpoint size.