Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Authors:
Victor W. Lee;Changkyu Kim;Jatin Chhugani;Michael Deisher;Daehyun Kim;Anthony D. Nguyen;Nadathur Satish;Mikhail Smelyanskiy;Srinivas Chennupaty;Per Hammarlund;Ronak Singhal;Pradeep Dubey
Affiliations:
Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Hillsboro, OR, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Hillsboro, OR, USA;Intel Corporation, Hillsboro, OR, USA;Intel Corporation, Hillsboro, OR, USA;Intel Corporation, Santa Clara, CA, USA
Venue:
Proceedings of the 37th annual international symposium on Computer architecture
Year:
2010

Citing 29
Cited 73

A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures

Circuits, Systems, and Signal Processing
A High-Performance FFT Algorithm for Vector Supercomputers-Abstract

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
GPUTeraSort: high performance graphics co-processor sorting for large database management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Rigid body collision detection on the GPU

ACM SIGGRAPH 2006 Research posters
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Using compression to improve chip multiprocessor performance

Using compression to improve chip multiprocessor performance
Carbon: architectural support for fine-grained parallelism on chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
Atomic Vector Operations on Chip Multiprocessors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
TeraFLOP computing on a desktop PC with GPUs for 3D CFD

International Journal of Computational Fluid Dynamics - Mesoscopic Methods And Their Applications To CFD
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Efficient implementation of sorting on multi-core SIMD CPU architecture

Proceedings of the VLDB Endowment
Parallel Image Processing Based on CUDA

CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 03
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
Thread motion: fine-grained power management for multi-core systems

Proceedings of the 36th annual international symposium on Computer architecture
Achieving predictable performance through better memory controller placement in many-core CMPs

Proceedings of the 36th annual international symposium on Computer architecture
Designing efficient sorting algorithms for manycore GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures

IEEE Transactions on Visualization and Computer Graphics
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Accelerating CUDA graph algorithms at maximum warp

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Comparing GPU and CPU in OLAP cubes creation

SOFSEM'11 Proceedings of the 37th international conference on Current trends in theory and practice of computer science
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
CnC-CUDA: declarative programming for GPUs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Rapid computation of distance estimators from nucleotide and amino acid alignments

Proceedings of the 2011 ACM Symposium on Applied Computing
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

Journal of Computational Physics
Experience of parallelizing cryo-EM 3D reconstruction on a CPU-GPU heterogeneous system

Proceedings of the 20th international symposium on High performance distributed computing
Dark silicon and the end of multicore scaling

Proceedings of the 38th annual international symposium on Computer architecture
Considerations when evaluating microprocessor platforms

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
High performance content-based matching using GPUs

Proceedings of the 5th ACM international conference on Distributed event-based system
Multi- and many-core data mining with adaptive sparse grids

Proceedings of the 8th ACM International Conference on Computing Frontiers
GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method

Journal of Computational and Applied Mathematics
Lessons learned from exploring the backtracking paradigm on the GPU

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A memory accelerator with gather functions for bandwidth-bound irregular applications

Proceedings of the first workshop on Irregular applications: architectures and algorithm
A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Paragon: collaborative speculative loop execution on GPU and CPU

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Many-Core architecture oriented parallel algorithm design for computer animation

MIG'11 Proceedings of the 4th international conference on Motion in Games
True 4D image denoising on the GPU

Journal of Biomedical Imaging - Special issue on Parallel Computation in Medical Imaging Applications
Improving performance of adaptive component-based dataflow middleware

Parallel Computing
Parallelizing flow-accumulation calculations on graphics processing units-From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

Computers & Geosciences
Parallel terrain visibility calculation on the graphics processing unit

Concurrency and Computation: Practice & Experience
GiST scan acceleration using coprocessors

DaMoN '12 Proceedings of the Eighth International Workshop on Data Management on New Hardware
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
OpenCL implementation of particle swarm optimization: a comparison between multi-core CPU and GPU performances

EvoApplications'12 Proceedings of the 2012t European conference on Applications of Evolutionary Computation
A fair comparison of modern CPUs and GPUs running the genetic algorithm under the knapsack benchmark

EvoApplications'12 Proceedings of the 2012t European conference on Applications of Evolutionary Computation
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm

Concurrency and Computation: Practice & Experience
Power Limitations and Dark Silicon Challenge the Future of Multicore

ACM Transactions on Computer Systems (TOCS)
Simultaneous branch and warp interweaving for sustained GPU performance

Proceedings of the 39th Annual International Symposium on Computer Architecture
Staged memory scheduling: achieving high performance and scalability in heterogeneous systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Proceedings of the 39th Annual International Symposium on Computer Architecture
Power Modeling and Characterization of Computing Devices: A Survey

Foundations and Trends in Electronic Design Automation
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Accelerating pathology image data cross-comparison on CPU-GPU hybrid systems

Proceedings of the VLDB Endowment
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU, GPU and FPGA

Concurrency and Computation: Practice & Experience
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Energy consumption modeling for hybrid computing

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Automatic selection of processing units for coprocessing in databases

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Spill code placement for SIMD machines

SBLP'12 Proceedings of the 16th Brazilian conference on Programming Languages
CAP: co-scheduling based on asymptotic profiling in CPU+GPU hybrid systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Warped-DMR: Light-weight Error Detection for GPGPU

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
The BiConjugate gradient method on GPUs

The Journal of Supercomputing
GPU acceleration of regular expression matching for large datasets: exploring the implementation space

Proceedings of the ACM International Conference on Computing Frontiers
Scaling analytics applications with OpenCL for loosely coupled heterogeneous clusters

Proceedings of the ACM International Conference on Computing Frontiers
Information-theoretic analysis of molecular (co)evolution using graphics processing units

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Data management systems on GPUs: promises and challenges

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
On supernode transformations and multithreading for the longest common subsequence problem

AusPDC '12 Proceedings of the Tenth Australasian Symposium on Parallel and Distributed Computing - Volume 127
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Robust and efficient polygon overlay on parallel stream processors

Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Evaluating integrated graphics processors for data center workloads

Proceedings of the Workshop on Power-Aware Computing and Systems
Wimpy or brawny cores: A throughput perspective

Journal of Parallel and Distributed Computing
User transparent data and task parallel multimedia computing with Pyxis-DT

Future Generation Computer Systems
Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Journal of Parallel and Distributed Computing
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Easy, fast, and energy-efficient object detection on heterogeneous on-chip architectures

ACM Transactions on Architecture and Code Optimization (TACO)
On the automatic generation of GPU-oriented software applications from RTL IPs

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Evaluating application performance and energy consumption on hybrid CPU+GPU architecture

Cluster Computing
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
Analytical modeling of energy efficiency in heterogeneous processors

Computers and Electrical Engineering
Detecting, segmenting and tracking unknown objects using multi-label MRF inference

Computer Vision and Image Understanding
High level transforms for SIMD and low-level computer vision algorithms

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Test-driving Intel Xeon Phi

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Leveraging GPUs using cooperative loop speculation

ACM Transactions on Architecture and Code Optimization (TACO)
CPU+GPU scheduling with asymptotic profiling

Parallel Computing
A Case Study of Implementing Supernode Transformations

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.