On the limits of GPU acceleration

Authors:
Richard Vuduc;Aparna Chandramowlishwaran;Jee Choi;Murat Guney;Aashay Shringarpure
Affiliations:
Georgia Institute of Technology, School of Computational Science and Engineering;Georgia Institute of Technology, School of Computational Science and Engineering;Georgia Institute of Technology, School of Electrical and Computer Engineering;Georgia Institute of Technology, School of Civil and Environmental Engineering;Georgia Institute of Technology, School of Computer Science
Venue:
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Year:
2010

Citing 12
Cited 14

A fast algorithm for particle simulations

Journal of Computational Physics
Applied numerical linear algebra

Applied numerical linear algebra
A kernel-independent adaptive fast multipole algorithm in two and three dimensions

Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Fast multipole methods on graphics processors

Journal of Computational Physics
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A massively parallel adaptive fast-multipole method on heterogeneous architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Direct N-body Kernels for Multicore Platforms

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
High-performance direct solution of finite element problems on multi-core processors

High-performance direct solution of finite element problems on multi-core processors

Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Leveraging workload diversity through OS scheduling to maximize performance on single-ISA heterogeneous multicore systems

Journal of Parallel and Distributed Computing
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Considerations when evaluating microprocessor platforms

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Lessons learned from exploring the backtracking paradigm on the GPU

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Optimization of N-queens solvers on graphics processors

APPT'11 Proceedings of the 9th international conference on Advanced parallel processing technologies
Original article: Parallel collision detection of ellipsoids with applications in large scale multibody dynamics

Mathematics and Computers in Simulation
A GPU implementation of inclusion-based points-to analysis

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPU acceleration of the matrix-free interior point method

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
GPU-based parallel vertex substitution algorithm for the p-median problem

Computers and Industrial Engineering
Adaptive fast multipole methods on the GPU

The Journal of Supercomputing
Fast on-line statistical learning on a GPGPU

AusPDC '11 Proceedings of the Ninth Australasian Symposium on Parallel and Distributed Computing - Volume 118
Efficient 3D stencil computations using CUDA

Parallel Computing
Performance models and workload distribution algorithms for optimizing a hybrid CPU-GPU multifrontal solver

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper throws a small "wet blanket" on the hot topic of GPGPU acceleration, based on experience analyzing and tuning both multithreaded CPU and GPU implementations of three computations in scientific computing. These computations--(a) iterative sparse linear solvers; (b) sparse Cholesky factorization; and (c) the fast multipole method--exhibit complex behavior and vary in computational intensity and memory reference irregularity. In each case, algorithmic analysis and prior work might lead us to conclude that an idealized GPU can deliver better performance, but we find that for at least equal-effort CPU tuning and consideration of realistic workloads and calling-contexts, we can with two modern quad-core CPU sockets roughly match one or two GPUs in performance. Our conclusions are not intended to dampen interest in GPU acceleration; on the contrary, they should do the opposite: they partially illuminate the boundary between CPU and GPU performance, and ask architects to consider application contexts in the design of future coupled on-die CPU/GPU processors.