Thin servers with smart pipes: designing SoC accelerators for memcached

Authors:
Kevin Lim;David Meisner;Ali G. Saidi;Parthasarathy Ranganathan;Thomas F. Wenisch
Affiliations:
HP Labs;Facebook;ARM R&D;HP Labs;EECS, Univ. of Michigan
Venue:
Proceedings of the 40th Annual International Symposium on Computer Architecture
Year:
2013

Citing 28
Cited 5

Application performance pitfalls and TCP's Nagle algorithm

ACM SIGMETRICS Performance Evaluation Review
TCP Onloading for Data Center Servers

Computer
Direct Cache Access for High Bandwidth Network I/O

Proceedings of the 32nd annual international symposium on Computer Architecture
Performance Analysis of System Overheads in TCP/IP Workloads

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Integrated network interfaces for high-bandwidth TCP/IP

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dynamo: amazon's highly available key-value store

Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A measurement-driven analysis of information propagation in the flickr social network

Proceedings of the 18th international conference on World wide web
VL2: a scalable and flexible data center network

Proceedings of the ACM SIGCOMM 2009 conference on Data communication
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Performance Measurement of an Integrated NIC Architecture with 10GbE

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
Energy proportional datacenter networks

Proceedings of the 37th annual international symposium on Computer architecture
Hedera: dynamic flow scheduling for data center networks

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Finding a needle in Haystack: facebook's photo storage

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
The power of one move: hashing schemes for hardware

IEEE/ACM Transactions on Networking (TON)
It's time for low latency

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Warehouse-Scale Computing: Entering the Teenage Decade

Proceedings of the 38th annual international symposium on Computer architecture
Memcached Design on High Performance RDMA Capable Interconnects

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
CPHASH: a cache-partitioned hash table

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Many-core key-value store

IGCC '11 Proceedings of the 2011 International Green Computing Conference and Workshops
Workload analysis of a large-scale key-value store

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
Wimpy nodes with 10GbE: leveraging one-sided operations in soft-RDMA to boost memcached

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Chronos: predictable low latency for data center applications

Proceedings of the Third ACM Symposium on Cloud Computing
An FPGA memcached appliance

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Scaling Memcache at Facebook

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation

RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Integrated 3D-stacked server designs for increasing physical density of key-value stores

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Rhythm: harnessing data parallel hardware for server workloads

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
MICA: a holistic approach to fast in-memory key-value storage

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.