Benchmarking GPUs to tune dense linear algebra

Authors:
Vasily Volkov;James W. Demmel
Affiliations:
University of California at Berkeley;University of California at Berkeley
Venue:
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Year:
2008

Citing 15
Cited 125

Vector and parallel algorithms for Cholesky factorization on IBM 3090

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The Tera computer system

ICS '90 Proceedings of the 4th international conference on Supercomputing
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
An Adaptive Blocking Strategy for Matrix Factorizations

CONPAR 90/VAPP IV Proceedings of the Joint International Conference on Vector and Parallel Processing
LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860

LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The Cray BlackWidow: a highly scalable vector multiprocessor

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming

Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Solving dense linear systems on platforms with multiple hardware accelerators

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Aspects of GPU for general purpose high performance computing

Proceedings of the 2009 Asia and South Pacific Design Automation Conference
Performance analysis of accelerated image registration using GPGPU

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
QR decomposition on GPUs

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition

Computer Speech and Language
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems

Proceedings of the 23rd international conference on Supercomputing
Single-particle 3d reconstruction from cryo-electron microscopy images on GPU

Proceedings of the 23rd international conference on Supercomputing
A Note on Auto-tuning GEMM for GPUs

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs

SAMOS '09 Proceedings of the 9th International Workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Performance Optimization Strategies of High Performance Computing on GPU

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
GPU based sparse grid technique for solving multidimensional options pricing PDEs

Proceedings of the 2nd Workshop on High Performance Computational Finance
Triangular matrix inversion on Graphics Processing Unit

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel implementation of a financial application on a GPU

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Fast tridiagonal solvers on the GPU

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Scaling LAPACK panel operations using parallel cache assignment

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A design case study: CPU vs. GPGPU vs. FPGA

MEMOCODE'09 Proceedings of the 7th IEEE/ACM international conference on Formal Methods and Models for Codesign
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
The Scalable Heterogeneous Computing (SHOC) benchmark suite

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Integrating neuromuscular and cyber systems for neural control of artificial legs

Proceedings of the 1st ACM/IEEE International Conference on Cyber-Physical Systems
State-of-the-art in heterogeneous computing

Scientific Programming
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Large-scale FFT on GPU clusters

Proceedings of the 24th ACM International Conference on Supercomputing
Small-ruleset regular expression matching on GPGPUs: quantitative performance analysis and optimization

Proceedings of the 24th ACM International Conference on Supercomputing
Parallel implementation of Artificial Neural Network training for speech recognition

Pattern Recognition Letters
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Reconfigurable real-time MIMO detector on GPU

Asilomar'09 Proceedings of the 43rd Asilomar conference on Signals, systems and computers
Data layout transformation exploiting memory-level parallelism in structured grid many-core applications

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
Reduction to condensed forms for symmetric eigenvalue problems on multi-core architectures

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An efficient implementation of GPU virtualization in high performance clusters

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Source-to-source optimization of CUDA C for GPU accelerated cardiac cell modeling

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
An Improved Magma Gemm For Fermi Graphics Processing Units

International Journal of High Performance Computing Applications
Compact data structure and scalable algorithms for the sparse grid technique

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Achieving a single compute device image in OpenCL for multiple GPUs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Computing on multi-core platform: performance issues

Proceedings of the 2011 International Conference on Communication, Computing & Security
Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
Monte Carlo methods: a computational pattern for our pattern language

Proceedings of the 2010 Workshop on Parallel Programming Patterns
Register packing for cyclic reduction: a case study

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A fast GEMM implementation on the cypress GPU

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Accelerating GPU kernels for dense linear algebra

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
A scalable high performant Cholesky factorization for multicore with GPU accelerators

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Global memory access modelling for efficient implementation of the lattice Boltzmann method on graphics processing units

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Improving accuracy for matrix multiplications on GPUs

Scientific Programming
Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

International Journal of High Performance Computing Applications
Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

Advances in Engineering Software
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Designing and dynamically load balancing hybrid LU for multi/many-core

Computer Science - Research and Development
Optimized HPL for AMD GPU and multi-core CPU usage

Computer Science - Research and Development
A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms

Parallel Computing
Bounding the effect of partition camping in GPU kernels

Proceedings of the 8th ACM International Conference on Computing Frontiers
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
kNN query processing in metric spaces using GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
A fully empirical autotuned dense QR factorization for multicore architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Automatic OpenCL device characterization: guiding optimized kernel design

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Optimizing symmetric dense matrix-vector multiplication on GPUs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fast implementation of DGEMM on Fermi GPU

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A CPU-GPU hybrid approach for the unsymmetric multifrontal method

Parallel Computing
Efficient Parallel Nonnegative Least Squares on Multicore Architectures

SIAM Journal on Scientific Computing
GPU-based single-cluster algorithm for the simulation of the Ising model

Journal of Computational Physics
Portable and scalable FPGA-based acceleration of a direct linear system solver

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Optimizing sweep3d for graphic processor unit

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Reducing off-chip memory traffic by selective cache management scheme in GPGPUs

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Automatically tuning sparse matrix-vector multiplication for GPU architectures

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Toward techniques for auto-tuning GPU algorithms

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
GPU programming in a high level language: compiling X10 to CUDA

Proceedings of the 2011 ACM SIGPLAN X10 Workshop
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Proceedings of the 9th conference on Computing Frontiers
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Spherical harmonic transform with GPUs

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Design patterns for scientific computations on sparse matrices

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Extending a highly parallel data mining algorithm to the intel ® many integrated core architecture

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
Interference-driven resource management for GPU-based heterogeneous clusters

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs

Proceedings of the 26th ACM international conference on Supercomputing
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Journal of Computational Physics
GPU Performance Enhancement via Communication Cost Reduction: Case Studies of Radix Sort and WSN Relay Node Placement Problem

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Parallel Computing
Acoustic scattering solver based on single level FMM for multi-GPU systems

Journal of Parallel and Distributed Computing
Tuning solution of large non-Hermitian linear systems on multiple graphics processing unit accelerated workstations

International Journal of High Performance Computing Applications
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
Shared memory multiplexing: a novel way to improve GPGPU throughput

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Invariants of distance k-graphs for graph embedding

Pattern Recognition Letters
Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Layout-oblivious compiler optimization for matrix computations

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

International Journal of Grid and High Performance Computing
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Systematic approach in optimizing numerical memory-bound kernels on GPU

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
GPU implementation of a novel hybrid lattice Boltzmann method for non-isothermal flows

Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies
Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms

Proceedings of the 27th international ACM conference on International conference on supercomputing
SemCache: semantics-aware caching for efficient GPU offloading

Proceedings of the 27th international ACM conference on International conference on supercomputing
Complexity of the path avoiding forbidden pairs problem revisited

Discrete Applied Mathematics
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Assessing the performance of OpenMP programs on the intel xeon phi

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Efficient 3D stencil computations using CUDA

Parallel Computing
Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
In-place transposition of rectangular matrices on accelerators

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
HARP: Harnessing inactive threads in many-core processors

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Preliminary performance evaluations of the determinant quantum Monte Carlo simulations for multi-core CPU and many-core GPU

International Journal of Computational Science and Engineering
Test-driving Intel Xeon Phi

Proceedings of the 5th ACM/SPEC international conference on Performance engineering
Toward GPU accelerated topology optimization on unstructured meshes

Structural and Multidisciplinary Optimization
Hardware---software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Journal of Real-Time Image Processing
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms

Scientific Programming
Numerical integration on GPUs for higher order finite elements

Computers & Mathematics with Applications
A low-cost 3D human interface device using GPU-based optical flow algorithms

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present performance results for dense linear algebra using recent NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs up to 60% faster than the vendor's implementation and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80--90% of the peak GEMM rate. Our parallel LU running on two GPUs achieves up to ~540 Gflop/s. These results are accomplished by challenging the accepted view of the GPU architecture and programming guidelines. We argue that modern GPUs should be viewed as multithreaded multicore vector units. We exploit blocking similarly to vector computers and heterogeneity of the system by computing both on GPU and CPU. This study includes detailed benchmarking of the GPU memory system that reveals sizes and latencies of caches and TLB. We present a couple of algorithmic optimizations aimed at increasing parallelism and regularity in the problem that provide us with slightly higher performance.