Using one-sided RDMA reads to build a fast, CPU-efficient key-value store

Authors:
Christopher Mitchell;Yifeng Geng;Jinyang Li
Affiliations:
New York University;Tsinghua University;New York University
Venue:
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Year:
2013

Citing 18
Cited 6

Linearizability: a correctness condition for concurrent objects

ACM Transactions on Programming Languages and Systems (TOPLAS)
Cuckoo hashing

Journal of Algorithms
Distributed caching with memcached

Linux Journal
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Benchmarking cloud serving systems with YCSB

Proceedings of the 1st ACM symposium on Cloud computing
Infiniband scalability in open MPI

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Piccolo: building fast, distributed programs with partitioned tables

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Algorithms

Algorithms
Memcached Design on High Performance RDMA Capable Interconnects

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
A case for RDMA in clouds: turning supercomputer networking into commodity

Proceedings of the Second Asia-Pacific Workshop on Systems
Cache craftiness for fast multicore key-value storage

Proceedings of the 7th ACM european conference on Computer Systems
Workload analysis of a large-scale key-value store

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Scalable Memcached Design for InfiniBand Clusters Using Hybrid Transports

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Wimpy nodes with 10GbE: leveraging one-sided operations in soft-RDMA to boost memcached

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Performance Analysis and Evaluation of InfiniBand FDR and 40GigE RoCE on HPC and Cloud Computing Systems

HOTI '12 Proceedings of the 2012 IEEE 20th Annual Symposium on High-Performance Interconnects
Spanner: Google's globally-distributed database

OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Scaling Memcache at Facebook

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

jVerbs: ultra-low latency for data center applications

Proceedings of the 4th annual Symposium on Cloud Computing
On limitations of network acceleration

Proceedings of the ninth ACM conference on Emerging networking experiments and technologies
Scale-out NUMA

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
FaRM: fast remote memory

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
MICA: a holistic approach to fast in-memory key-value storage

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
mTCP: a highly scalable user-level TCP stack for multicore systems

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent technological trends indicate that future datacenter networks will incorporate High Performance Computing network features, such as ultra-low latency and CPU bypassing. How can these features be exploited in datacenter-scale systems infrastructure? In this paper, we explore the design of a distributed in-memory key-value store called Pilaf that takes advantage of Remote Direct Memory Access to achieve high performance with low CPU overhead. In Pilaf, clients directly read from the server's memory via RDMA to perform gets, which commonly dominate key-value store workloads. By contrast, put operations are serviced by the server to simplify the task of synchronizing memory accesses. To detect inconsistent RDMA reads with concurrent CPU memory modifications, we introduce the notion of self-verifying data structures that can detect read-write races without client-server coordination. Our experiments show that Pilaf achieves low latency and high throughput while consuming few CPU resources. Specifically, Pilaf can surpass 1.3 million ops/sec (90% gets) using a single CPU core compared with 55K for Memcached and 59K for Redis.