Scalable, fault tolerant membership for MPI tasks on HPC systems

Authors:
Jyothish Varma;Chao Wang;Frank Mueller;Christian Engelmann;Stephen L. Scott
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
Proceedings of the 20th annual international conference on Supercomputing
Year:
2006

Citing 23
Cited 5

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The Totem single-ring ordering and membership protocol

ACM Transactions on Computer Systems (TOCS)
LogP: a practical model of parallel computation

Communications of the ACM
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Coyote: a system for constructing fine-grain configurable communication services

ACM Transactions on Computer Systems (TOCS)
Reliable Distributed Computing with the ISIS Toolkit

Reliable Distributed Computing with the ISIS Toolkit
Distributed Peer-to-Peer Control in Harness

ICCS '02 Proceedings of the International Conference on Computational Science-Part II
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Fault-Tolerance for Token-based Synchronization Protocols

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Collective operations in application-level fault-tolerant MPI

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A Log(n) Multi-Mode Locking Protocol for Distributed Systems

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A Client-Server Oriented Algorithm for Virtually Synchronous Group Membership in WANs

ICDCS '00 Proceedings of the The 20th International Conference on Distributed Computing Systems ( ICDCS 2000)
Scalable Distributed Concurrency Services for Hierarchical Locking

ICDCS '03 Proceedings of the 23rd International Conference on Distributed Computing Systems
Group membership: a novel approach and the first single-round algorithm

Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing
Scalable hierarchical locking for distributed systems

Journal of Parallel and Distributed Computing - Special issue on middleware
Total order broadcast and multicast algorithms: Taxonomy and survey

ACM Computing Surveys (CSUR)
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
An Efficient Topology-Adaptive Membership Protocol for Large-Scale Cluster-Based Services

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Using Leader-Based Communication to Improve the Scalability of Single-Round Group Membership Algorithms

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Analysis of the component architecture overhead in open MPI

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Poster: scalable infrastructure to support supercomputer resiliency-aware applications and load balancing

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Fault tolerance using lower fidelity data in adaptive mesh applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-time-to-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small recon guration overhead within the fault-tolerant layer.This paper contributes a scalable approach to recon gure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response times in the order of hundreds of microseconds and single-digit milliseconds for recon guration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems, and we have integrated an early version into LAM/MPI.