Harness: a next generation distributed virtual machine
Future Generation Computer Systems - Special issue on metacomputing
Scalable networked information processing environment (SNIPE)
Future Generation Computer Systems - Special issue on metacomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
PVM Emulation in the Harness Metacomputing System: A Plug-in Based Approach
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Towards an Accurate Model for Collective Communications
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Automatic Reincarnation of Deceased Plug-Ins in the HARNESS Metacomputing System
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Solving Engineering Applications with LAMGAC over MPI-2
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Impact of Event Logger on Causal Message Logging Protocols for Fault Tolerant MPI
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Experimental Assessment of the Practicality of a Fault-Tolerant System
SOFSEM '07 Proceedings of the 33rd conference on Current Trends in Theory and Practice of Computer Science
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Reliable parallel programming model for distributed computing environments
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Transparent redundant computing with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Towards building a highly-available cluster based model for high performance computing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Mobile multimedia for multiuser environments
Journal of Mobile Multimedia
Fault tolerance in the mobile environment
Journal of Mobile Multimedia
Tolerating correlated failures for generalized Cartesian distributions via bipartite matching
Proceedings of the 8th ACM International Conference on Computing Frontiers
Fault tolerance in an industrial seismic processing application for multicore clusters
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
SpotMPI: a framework for auction-based HPC computing using amazon spot instances
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
A hybrid fault tolerance scheme for EasyGrid MPI applications
Proceedings of the 9th International Workshop on Middleware for Grids, Clouds and e-Science
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Can MPI be used for persistent parallel services?
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
SHIELD: a fault-tolerant MPI for an infiniband cluster
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Migol: a fault-tolerant service framework for MPI applications in the grid
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A peer-to-peer framework for robust execution of message passing parallel programs on grids
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Algorithm-based fault tolerance for dense matrix factorizations
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dynamic failure management for parallel applications on grids
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Enabling Application Resilience with and without the MPI Standard
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Open issues in MPI implementation
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An evaluation of user-level failure mitigation support in MPI
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
X10-FT: transparent fault tolerance for APGAS language and runtime
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds
ACM SIGOPS Operating Systems Review
The Journal of Supercomputing
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Trends and outlook for the massive-scale analytics stack
IBM Journal of Research and Development
Hi-index | 0.00 |
Initial versions of MPI were designed to work efficiently on multiprocessors which had very little job control and thus static process models, subsequently forcing them to support dynamic process operations would have effected their performance. As current HPC systems increase in size with higher potential levels of individual node failure, the need rises for new fault tolerant systems to be developed. Here we present a new implementation of MPI called FT-MPI that allows the semantics and associated failure modes to be completely controlled by the application. Given is an overview of the FT-MPI semantics, design and some performance issues as well as the HARNESS g_hcore implementation it is built upon.