Rhythm: harnessing data parallel hardware for server workloads

Authors:
Sandeep R. Agrawal;Valentin Pistol;Jun Pang;John Tran;David Tarjan;Alvin R. Lebeck
Affiliations:
Duke University, Durham, USA;Duke University, Durham, USA;Duke University, Durham, USA;NVIDIA Corporation, Santa Clara, USA;NVIDIA Corporation, Santa Clara, USA;Duke University, Durham, USA
Venue:
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Year:
2014

Citing 41
Cited 0

Using Cohort Scheduling to Enhance Server Performance (Extended Abstract)

OM '01 Proceedings of the 2001 ACM SIGPLAN workshop on Optimization of middleware and distributed systems
Managing energy and server resources in hosting centers

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
The Apache HTTP Server Project

IEEE Internet Computing
Fast computation of database operations using graphics processors

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Distributed caching with memcached

Linux Journal
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Power provisioning for a warehouse-sized computer

Proceedings of the 34th annual international symposium on Computer architecture
GPU computing with NVIDIA CUDA

ACM SIGGRAPH 2007 courses
AMD CTM overview

ACM SIGGRAPH 2007 courses
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Thread fusion

Proceedings of the 13th international symposium on Low power electronics and design
PicoServer: Using 3D stacking technology to build energy efficient servers

ACM Journal on Emerging Technologies in Computing Systems (JETC)
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
FAWN: a fast array of wimpy nodes

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Increasing memory miss tolerance for SIMD cores

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Rodinia: A benchmark suite for heterogeneous computing

IISWC '09 Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
Accelerating SQL database operations on a GPU with CUDA

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Dynamic warp subdivision for integrated branch and memory divergence tolerance

Proceedings of the 37th annual international symposium on Computer architecture
Web search using mobile cores: quantifying and mitigating the price of efficiency

Proceedings of the 37th annual international symposium on Computer architecture
The Akamai network: a platform for high-performance internet applications

ACM SIGOPS Operating Systems Review
Challenges and Opportunities for Extremely Energy-Efficient Processors

IEEE Micro
PacketShader: a GPU-accelerated software router

Proceedings of the ACM SIGCOMM 2010 conference
Minimal Multi-threading: Finding and Removing Redundant Instructions in Multi-threaded Processors

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Thread block compaction for efficient SIMT control flow

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Data-triggered threads: Eliminating redundant computation

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Clearing the clouds: a study of emerging scale-out workloads on modern hardware

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
DreamWeaver: architectural support for deep sleep

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
MIMD interpretation on a GPU

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
System-level integrated server architectures for scale-out datacenters

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Power balanced pipelines

HPCA '12 Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture
AMD Fusion APU: Llano

IEEE Micro
Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems

ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
Scale-out processors

Proceedings of the 39th Annual International Symposium on Computer Architecture
Using vector interfaces to deliver millions of IOPS from a networked key-value storage server

Proceedings of the Third ACM Symposium on Cloud Computing
The tail at scale

Communications of the ACM
An FPGA memcached appliance

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
GPUfs: integrating a file system with GPUs

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Thin servers with smart pipes: designing SoC accelerators for memcached

Proceedings of the 40th Annual International Symposium on Computer Architecture
STREX: boosting instruction cache reuse in OLTP workloads through stratified transaction execution

Proceedings of the 40th Annual International Symposium on Computer Architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center efficiency primarily focuses on the data center as a whole, using off-the-shelf hardware for individual servers. Server capacity is typically increased by adding more machines, which is cheap, though inefficient in the long run in terms of energy and area. Our work builds on the observation that server workload execution patterns are not completely unique across multiple requests. We present a framework---called Rhythm---for high throughput servers that can exploit similarity across requests to improve server performance and power/energy efficiency by launching data parallel executions for request cohorts. An implementation of the SPECWeb Banking workload using Rhythm on NVIDIA GPUs provides a basis for evaluating both software and hardware for future cohort-based servers. Our evaluation of Rhythm on future server platforms shows that it achieves 4x the throughput (reqs/sec) of a core i7 at efficiencies (reqs/Joule) comparable to a dual core ARM Cortex A9. A Rhythm implementation that generates transposed responses achieves 8x the i7 throughput while processing 2.5x more requests/Joule compared to the A9.