Application performance pitfalls and TCP's Nagle algorithm
ACM SIGMETRICS Performance Evaluation Review
Direct Cache Access for High Bandwidth Network I/O
Proceedings of the 32nd annual international symposium on Computer Architecture
Performance Analysis of System Overheads in TCP/IP Workloads
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Integrated network interfaces for high-bandwidth TCP/IP
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
A measurement-driven analysis of information propagation in the flickr social network
Proceedings of the 18th international conference on World wide web
VL2: a scalable and flexible data center network
Proceedings of the ACM SIGCOMM 2009 conference on Data communication
FAWN: a fast array of wimpy nodes
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Performance Measurement of an Integrated NIC Architecture with 10GbE
HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Web search using mobile cores: quantifying and mitigating the price of efficiency
Proceedings of the 37th annual international symposium on Computer architecture
Energy proportional datacenter networks
Proceedings of the 37th annual international symposium on Computer architecture
Hedera: dynamic flow scheduling for data center networks
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Finding a needle in Haystack: facebook's photo storage
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
The power of one move: hashing schemes for hardware
IEEE/ACM Transactions on Networking (TON)
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Warehouse-Scale Computing: Entering the Teenage Decade
Proceedings of the 38th annual international symposium on Computer architecture
Memcached Design on High Performance RDMA Capable Interconnects
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
CPHASH: a cache-partitioned hash table
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
IGCC '11 Proceedings of the 2011 International Green Computing Conference and Workshops
Workload analysis of a large-scale key-value store
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems
ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
Wimpy nodes with 10GbE: leveraging one-sided operations in soft-RDMA to boost memcached
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Chronos: predictable low latency for data center applications
Proceedings of the Third ACM Symposium on Cloud Computing
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Integrated 3D-stacked server designs for increasing physical density of key-value stores
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Rhythm: harnessing data parallel hardware for server workloads
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
MICA: a holistic approach to fast in-memory key-value storage
NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation
Hi-index | 0.00 |
Distributed in-memory key-value stores, such as memcached, are central to the scalability of modern internet services. Current deployments use commodity servers with high-end processors. However, given the cost-sensitivity of internet services and the recent proliferation of volume low-power System-on-Chip (SoC) designs, we see an opportunity for alternative architectures. We undertake a detailed characterization of memcached to reveal performance and power inefficiencies. Our study considers both high-performance and low-power CPUs and NICs across a variety of carefully-designed benchmarks that exercise the range of memcached behavior. We discover that, regardless of CPU microarchitecture, memcached execution is remarkably inefficient, saturating neither network links nor available memory bandwidth. Instead, we find performance is typically limited by the per-packet processing overheads in the NIC and OS kernel---long code paths limit CPU performance due to poor branch predictability and instruction fetch bottlenecks. Our insights suggest that neither high-performance nor low-power cores provide a satisfactory power-performance trade-off, and point to a need for tighter integration of the network interface. Hence, we argue for an alternate architecture---Thin Servers with Smart Pipes (TSSP)---for cost-effective high-performance memcached deployment. TSSP couples an embedded-class low-power core to a memcached accelerator that can process GET requests entirely in hardware, offloading both network handling and data look up. We demonstrate the potential benefits of our TSSP architecture through an FPGA prototyping platform, and show the potential for a 6X-16X power-performance improvement over conventional server baselines.