Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Network performance-aware collective communication for clustered wide-area systems
Parallel Computing - Clusters and computational grids for scientific computing
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
The Globus Project: A Status Report
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
MPICH-G2: a Grid-enabled implementation of the Message Passing Interface
Journal of Parallel and Distributed Computing - Special issue on computational grids
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scientific Programming
A Web Services Gateway for the H2O Lightweight Grid Computing Framework
ServiceWave '08 Proceedings of the 1st European Conference on Towards a Service-Based Internet
FT-MPI, fault-tolerant metacomputing and generic name services: a case study
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Applicability of generic naming services and fault-tolerant metacomputing with FT-MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Towards cross-platform cloud computing
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Hi-index | 0.00 |
We observe increasing interest in aggregating geographically distributed, heterogeneous resources to perform large scale computations. MPI remains the most popular programming paradigm for such applications; however, as the size of computing environments increases, fault tolerance aspects become critically important. We argue that the fault tolerance model proposed by FT-MPI fits well in geographically distributed environments, even though its current implementation is confined to a single administrative domain. We propose to overcome these limitations by combining FTMPI with the H2O resource sharing framework. Our approach allows users to run fault tolerant MPI programs on heterogeneous, geographically distributed shared machines, without sacrificing performance and with minimal involvement of resource providers.