ACM Transactions on Programming Languages and Systems (TOPLAS)
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Chord: A scalable peer-to-peer lookup service for internet applications
Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Measurement, modeling, and analysis of a peer-to-peer file-sharing workload
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Distributed caching with memcached
Linux Journal
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Overview of the IBM Blue Gene/P project
IBM Journal of Research and Development
An overview of the OMNeT++ simulation environment
Proceedings of the 1st international conference on Simulation tools and techniques for communications, networks and systems & workshops
Queue - Scalable Web Services
Evaluating Large Scale Distributed Simulation of P2P Networks
DS-RT '08 Proceedings of the 2008 12th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Epidemic-Style Global Load Monitoring in Large-Scale Overlay Networks
3PGCIC '10 Proceedings of the 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing
A Hierarchical DHT for Fault Tolerant Management in P2P-SIP Networks
ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Making a case for distributed file systems at Exascale
Proceedings of the third international workshop on Large-scale system and application performance
A survey and comparison of peer-to-peer overlay network schemes
IEEE Communications Surveys & Tutorials
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
Proceedings of the High Performance Computing Symposium
ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table
IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Hi-index | 0.00 |
Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peer-to-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.