Using simulation to explore distributed key-value stores for extreme-scale system services

Authors:
Ke Wang;Abhishek Kulkarni;Michael Lang;Dorian Arnold;Ioan Raicu
Affiliations:
Illinois Institute of Technology, Los Alamos National Laboratory;Indiana University;Los Alamos National Laboratory;University of New Mexico;Illinois Institute of Technology Argonne National Laboratory
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 20
Cited 0

Wait-free synchronization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Measurement, modeling, and analysis of a peer-to-peer file-sharing workload

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Distributed caching with memcached

Linux Journal
MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Performance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Overview of the IBM Blue Gene/P project

IBM Journal of Research and Development
An overview of the OMNeT++ simulation environment

Proceedings of the 1st international conference on Simulation tools and techniques for communications, networks and systems & workshops
Eventually Consistent

Queue - Scalable Web Services
Evaluating Large Scale Distributed Simulation of P2P Networks

DS-RT '08 Proceedings of the 2008 12th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Epidemic-Style Global Load Monitoring in Large-Scale Overlay Networks

3PGCIC '10 Proceedings of the 2010 International Conference on P2P, Parallel, Grid, Cloud and Internet Computing
A Hierarchical DHT for Fault Tolerant Management in P2P-SIP Networks

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Making a case for distributed file systems at Exascale

Proceedings of the third international workshop on Large-scale system and application performance
A survey and comparison of peer-to-peer overlay network schemes

IEEE Communications Surveys & Tutorials
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale

Proceedings of the High Performance Computing Symposium
ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table

IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peer-to-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.