Exploiting virtual synchrony in distributed systems
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Flexible update propagation for weakly consistent replication
Proceedings of the sixteenth ACM symposium on Operating systems principles
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Group communication specifications: a comprehensive study
ACM Computing Surveys (CSUR)
Peer-to-Peer Membership Management for Gossip-Based Protocols
IEEE Transactions on Computers
Araneola: A Scalable Reliable Multicast System for Dynamic Environments
NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
HiScamp: self-organizing hierarchical membership protocol
EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Scalable, fault tolerant membership for MPI tasks on HPC systems
Proceedings of the 20th annual international conference on Supercomputing
Symphony: distributed hashing in a small world
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
Gossiping in distributed systems
ACM SIGOPS Operating Systems Review - Gossip-based computer networking
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
Efficient reconciliation and flow control for anti-entropy protocols
LADIS '08 Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware
International Journal of High Performance Computing Applications
A gossip-style failure detection service
Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Census: location-aware membership management for large-scale distributed systems
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Scalable distributed consensus to support MPI fault tolerance
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Using the TOP500 to trace and project technology and architecture trends
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Linux kernel co-scheduling and bulk synchronous parallelism
International Journal of High Performance Computing Applications
Hi-index | 0.00 |
As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes.