Model-driven autotuning of sparse matrix-vector multiply on GPUs

Authors:
Jee W. Choi;Amik Singh;Richard W. Vuduc
Affiliations:
Georgia Institute of Technology, Atlanta, GA, USA;Indian Institute of Technology Roorkee, Roorkee, India;Georgia Institute of Technology, Atlanta, GA, USA
Venue:
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Year:
2010

Citing 14
Cited 46

Solving elliptic problems using ELLPACK

Solving elliptic problems using ELLPACK
Improving performance of sparse matrix-vector multiplication

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Towards a fast parallel sparse symmetric matrix-vector multiplication

Parallel Computing - Linear systems and associated problems
Sparse matrix solvers on the GPU: conjugate gradients and multigrid

ACM SIGGRAPH 2003 Papers
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture
Parallel finite element analysis platform for the earth simulator: GeoFEM

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Fast sparse matrix-vector multiplication by exploiting variable block structure

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Vectorized sparse matrix multiply for compressed row storage format

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

Best-effort semantic document search on GPUs

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Proceedings of the international conference on Supercomputing
MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters

Computer Science - Research and Development
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
A fully empirical autotuned dense QR factorization for multicore architectures

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Iterative sparse Matrix-Vector multiplication for integer factorization on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
GPU accelerated CAE using open solvers and the cloud

ACM SIGARCH Computer Architecture News
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
GPU-based NFA implementation for memory efficient high speed regular expression matching

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs

Microprocessors & Microsystems
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Design patterns for scientific computations on sparse matrices

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Parallel Computing
GPU acceleration of the matrix-free interior point method

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs

Parallel Computing
GPU-accelerated preconditioned iterative linear solvers

The Journal of Supercomputing
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

The Journal of Supercomputing
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Optimizing tensor contraction expressions for hybrid CPU-GPU execution

Cluster Computing
Influence of memory access patterns to small-scale FFT performance

The Journal of Supercomputing
The BiConjugate gradient method on GPUs

The Journal of Supercomputing
Input-aware auto-tuning for directive-based GPU programming

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Adaptive thread distributions for SpMV on a GPU

Proceedings of the Extreme Scaling Workshop
SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Efficient sparse matrix-vector multiplication on x86-based many-core processors

Proceedings of the 27th international ACM conference on International conference on supercomputing
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Starchart: hardware and software optimization using recursive partitioning regression trees

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
On the GPU performance of cell-centered finite volume method over unstructured tetrahedral meshes

IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A memory access model for highly-threaded many-core architectures

Future Generation Computer Systems
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Parallel Computing
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8× and 1.5× for single-and double-precision respectively. However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.