Supporting Fault-Tolerant Parallel Programming in Linda
IEEE Transactions on Parallel and Distributed Systems
ACM Transactions on Computer Systems (TOCS)
Scientific Computation with JavaSpaces
HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
BOINC: A System for Public-Resource Computing and Storage
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Distributed computing in practice: the Condor experience: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Salsa: Scalable Asynchronous Replica Exchange for Parallel Molecular Dynamics Applications
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
iShare – open internet sharing built on peer-to-peer and web
EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Hi-index | 0.00 |
PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ checkpoint-restart and/or process redundancy to make continuous forward progress in such an unreliable environment. A communication model based on one-sided Put/Get calls, pioneered by the Linda system, is a good match as processes can execute their communication operations independently and asynchronously. However, Linda and its many variants are not designed for communicating processes that are replicated or independently restarted from checkpoints. The key problem is that a single logical operation that impacts the global program state may be executed by different instances of the same process at different times leading to semantic inconsistency. This paper presents the design, execution model, implementation, and validation of a communication layer for robust execution on volatile nodes. The research leads to a practical way to employ idle PCs for latency tolerant parallel computing applications.