Design and implementation of a scalable membership service for supercomputer resiliency-aware runtime

Authors:
Yoav Tock;Benjamin Mandler;José Moreira;Terry Jones
Affiliations:
IBM Haifa Research Laboratory, Haifa, Israel;IBM Haifa Research Laboratory, Haifa, Israel;IBM T.J. Watson Research Center, Yorktown Heights, NY;Oak Ridge National Laboratory, Oak Ridge, TN
Venue:
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Year:
2013

Citing 22
Cited 0

Exploiting virtual synchrony in distributed systems

SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Flexible update propagation for weakly consistent replication

Proceedings of the sixteenth ACM symposium on Operating systems principles
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Group communication specifications: a comprehensive study

ACM Computing Surveys (CSUR)
Peer-to-Peer Membership Management for Gossip-Based Protocols

IEEE Transactions on Computers
Araneola: A Scalable Reliable Multicast System for Dynamic Environments

NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
HiScamp: self-organizing hierarchical membership protocol

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Scalable, fault tolerant membership for MPI tasks on HPC systems

Proceedings of the 20th annual international conference on Supercomputing
Symphony: distributed hashing in a small world

USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Gossiping in distributed systems

ACM SIGOPS Operating Systems Review - Gossip-based computer networking
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
Efficient reconciliation and flow control for anti-entropy protocols

LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Census: location-aware membership management for large-scale distributed systems

USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Scalable distributed consensus to support MPI fault tolerance

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Using the TOP500 to trace and project technology and architecture trends

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Poster: scalable infrastructure to support supercomputer resiliency-aware applications and load balancing

Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Linux kernel co-scheduling and bulk synchronous parallelism

International Journal of High Performance Computing Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes.