Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Authors:
K. Fatahalian;J. Sugerman;P. Hanrahan
Affiliations:
Stanford University;Stanford University;Stanford University
Venue:
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Year:
2004

Citing 8
Cited 62

LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Tuning Strassen's matrix multiplication for memory efficiency

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Physically-based visual simulation on graphics hardware

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Fast matrix multiplies using graphics hardware

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
The FFT on a GPU

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Linear algebra operators for GPU implementation of numerical algorithms

ACM SIGGRAPH 2003 Papers
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Brook for GPUs: stream computing on graphics hardware

ACM SIGGRAPH 2004 Papers

KD-tree acceleration structures for a GPU raytracer

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Shader Performance Analysis on a Modern GPU Architecture

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
ClawHMMER: A Streaming HMMer-Search Implementatio

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Solving point-feature labeling placement problem by parallel Hopfield neural network on GPU graphics card

Machine Graphics & Vision International Journal
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Interactive Collision Detection for Deformable Models Using Streaming AABBs

IEEE Transactions on Visualization and Computer Graphics
Efficient video decoding on GPUs by point based rendering

GH '06 Proceedings of the 21st ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Cache-efficient numerical algorithms using graphics hardware

Parallel Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Efficient gather and scatter operations on graphics processors

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Application development on hybrid systems

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Visions for application development on hybrid computing systems

Parallel Computing
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Efficient computation of sum-products on GPUs through software-managed cache

Proceedings of the 22nd annual international conference on Supercomputing
High-performance computing with desktop workstations

MATH'06 Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS
High performance 2D and 3D FDTD solvers on GPUs

MATH'06 Proceedings of the 10th WSEAS International Conference on APPLIED MATHEMATICS
A game loop architecture for the GPU used as a math coprocessor in real-time applications

Computers in Entertainment (CIE) - SPECIAL ISSUE: Media Arts
Using reconfigurable logic to optimise GPU memory accesses

Proceedings of the conference on Design, automation and test in Europe
Program optimization carving for GPU computing

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
All-pairs shortest-paths for large graphs on the GPU

Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Evaluating Computational Performance of Backpropagation Learning on Graphics Hardware

Electronic Notes in Theoretical Computer Science (ENTCS)
Attaining High Performance in General-Purpose Computations on Current Graphics Processors

High Performance Computing for Computational Science - VECPAR 2008
Accuracy and performance of graphics processors: A Quantum Monte Carlo application case study

Parallel Computing
Architecture-aware optimization targeting multithreaded stream computing

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors

Journal of Signal Processing Systems
Triangular matrix inversion on Graphics Processing Unit

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Parallel implementation of a financial application on a GPU

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation for Dense Matrices

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
An adaptive performance modeling tool for GPU architectures

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Parallel processing of matrix multiplication in a CPU and GPU heterogeneous environment

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
3D GPU architecture using cache stacking: performance, cost, power and thermal analysis

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Towards dense linear algebra for hybrid GPU accelerated manycore systems

Parallel Computing
Solving path problems on the GPU

Parallel Computing
FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Parallel Computing
An Improved Magma Gemm For Fermi Graphics Processing Units

International Journal of High Performance Computing Applications
A code motion technique for accelerating general-purpose computation on the GPU

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Analysis of Parallel Algorithms for Energy Conservation with GPU

GREENCOM-CPSCOM '10 Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing
A fast GEMM implementation on the cypress GPU

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Multifrontal computations on GPUs and their multi-core hosts

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP

Journal of Computational and Applied Mathematics
Spiking neural P system simulations on a high performance GPU platform

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
An introduction to GPU accelerated surgical simulation

ISBMS'06 Proceedings of the Third international conference on Biomedical Simulation
Performance study of LU decomposition on the programmable GPU

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
GPU-based active contour segmentation using gradient vector flow

ISVC'06 Proceedings of the Second international conference on Advances in Visual Computing - Volume Part I
A single (unified) shader GPU microarchitecture for embedded systems

HiPEAC'05 Proceedings of the First international conference on High Performance Embedded Architectures and Compilers
A high efficient on-chip interconnection network in SIMD CMPs

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Implementing survey propagation on graphics processing units

SAT'06 Proceedings of the 9th international conference on Theory and Applications of Satisfiability Testing
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
A spiking neural p system simulator based on CUDA

CMC'11 Proceedings of the 12th international conference on Membrane Computing
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Parallel Computing
Parallel computing of 3D smoking simulation based on OpenCL heterogeneous platform

The Journal of Supercomputing
Learning hash codes for efficient content reuse detection

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Concurrent number cruncher: an efficient sparse linear solver on the GPU

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
GPURoofline: a model for guiding performance optimizations on GPUs

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
An insightful program performance tuning chain for GPU computing

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
Graphics hardware based efficient and scalable fuzzy c-means clustering

AusDM '08 Proceedings of the 7th Australasian Data Mining Conference - Volume 87
Energy cost evaluation of parallel algorithms for multiprocessor systems

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.