LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Totem single-ring ordering and membership protocol
ACM Transactions on Computer Systems (TOCS)
LogP: a practical model of parallel computation
Communications of the ACM
Coyote: a system for constructing fine-grain configurable communication services
ACM Transactions on Computer Systems (TOCS)
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
Distributed Peer-to-Peer Control in Harness
ICCS '02 Proceedings of the International Conference on Computational Science-Part II
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerance for Token-based Synchronization Protocols
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A Log(n) Multi-Mode Locking Protocol for Distributed Systems
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Client-Server Oriented Algorithm for Virtually Synchronous Group Membership in WANs
ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Scalable Distributed Concurrency Services for Hierarchical Locking
ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Group membership: a novel approach and the first single-round algorithm
Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Scalable hierarchical locking for distributed systems
Journal of Parallel and Distributed Computing - Special issue on middleware
Total order broadcast and multicast algorithms: Taxonomy and survey
ACM Computing Surveys (CSUR)
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Analysis of the component architecture overhead in open MPI
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Fault tolerance using lower fidelity data in adaptive mesh applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.