Virtual-machine-based heterogeneous checkpointing
Software—Practice & Experience
Fault Tolerant MPI for the HARNESS Meta-computing System
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Virtual Machine Based Heterogeneous Checkpointing
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
TH-MPI: OS Kernel Integrated Fault Tolerant MPI
Proceedings of the 8th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-CM: A Communication Library Design for a P2P MPI Implementation
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Solving Engineering Applications with LAMGAC over MPI-2
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
HARNESS fault tolerant MPI design, usage and performance issues
Future Generation Computer Systems - Grid computing: Towards a new computing infrastructure
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Evaluating Distributed Checkpointing Protocol
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Quantifying rollback propagation in distributed checkpointing
Journal of Parallel and Distributed Computing
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Combining FT-MPI with H2O: Fault-Tolerant MPI Across Administrative Boundaries
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 1 - Volume 02
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Failure Resilient Heterogeneous Parallel Computing Across Multidomain Clusters
International Journal of High Performance Computing Applications
Building and Using a Fault-Tolerant MPI Implementation
International Journal of High Performance Computing Applications
A channel memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Is the island model fault tolerant?
Proceedings of the 9th annual conference companion on Genetic and evolutionary computation
Malleable applications for scalable high performance computing
Cluster Computing
Towards highly available and scalable high performance clusters
Journal of Computer and System Sciences
Migol: A fault-tolerant service framework for MPI applications in the grid
Future Generation Computer Systems
Model-based performance evaluation of distributed checkpointing protocols
Performance Evaluation
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Fault tolerant algorithms for heat transfer problems
Journal of Parallel and Distributed Computing
Experimental Assessment of the Practicality of a Fault-Tolerant System
SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Algorithm-based fault tolerance applied to high performance computing
Journal of Parallel and Distributed Computing
Interconnect agnostic checkpoint/restart in open MPI
Proceedings of the 18th ACM international symposium on High performance distributed computing
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Channel Memory based fault tolerance for MPI applications
Future Generation Computer Systems - Special issue: Parallel computing technologies
Characterizing fault tolerance in genetic programming
Future Generation Computer Systems
Fault tolerance in the mobile environment
Journal of Mobile Multimedia
A Robust and Efficient Message Passing Library for Volunteer Computing Environments
Journal of Grid Computing
SpotMPI: a framework for auction-based HPC computing using amazon spot instances
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
FT-MPI, fault-tolerant metacomputing and generic name services: a case study
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
An architecture for reconfigurable iterative MPI applications in dynamic environments
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Self-refined fault tolerance in HPC using dynamic dependent process groups
IWDC'05 Proceedings of the 7th international conference on Distributed Computing
Scalable fault tolerant MPI: extending the recovery algorithm
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Migol: a fault-tolerant service framework for MPI applications in the grid
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Applicability of generic naming services and fault-tolerant metacomputing with FT-MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Performance evaluation of consistent recovery protocols using MPICH-GF
EDCC'05 Proceedings of the 5th European conference on Dependable Computing
An efficient algorithm for removing useless logged messages in SBML protocols
ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Independent checkpointing in a heterogeneous grid environment
Future Generation Computer Systems
enhancing fault-tolerance of large-scale MPI scientific applications
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds
ACM SIGOPS Operating Systems Review
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 0.00 |
This paper reports on the architecture and design of {\em Starfish}, an environment for executing dynamic (and static) MPI-2 programs on a cluster of workstations.Starfish is unique in being efficient, fault-tolerant, highly available, and dynamic as a system internally, and in supporting fault-tolerance and dynamicity for its application programs as well.Starfish achieves these goals by combining group communication technology with checkpoint/restart, and uses a novel architecture that is both flexible and portable and keeps group communication outside the critical data path, for maximum performance.