Virtual-machine-based heterogeneous checkpointing
Software—Practice & Experience
Virtual Machine Based Heterogeneous Checkpointing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Middleware for the use of storage in communication
Parallel Computing - Special issue: Advanced environments for parallel and distributed computing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Fault-tolerant stream processing using a distributed, replicated file system
Proceedings of the VLDB Endowment
Synergy: a comprehensive software distributed shared memory system
ISPA'03 Proceedings of the 2003 international conference on Parallel and distributed processing and applications
Diet: new developments and recent results
Euro-Par'06 Proceedings of the CoreGRID 2006, UNICORE Summit 2006, Petascale Computational Biology and Bioinformatics conference on Parallel processing
Hi-index | 0.00 |
While a variety of checkpointing techniques and systems has been documented for long-running programs, they are typically not available for programmers that are non-systems experts. This paper details a project that integrates three technologies, NetSolve, Starfish, and IBP, for the seamless integration of fault-tolerance into long-running applications. We discuss the design and implementation of this project, and present performance results executing on both local and wide-area networks.