Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Authors:
Shiming Xu;Wei Xue;Hai Xiang Lin
Affiliations:
, Delft, The Netherlands 2628 CD;Tsinghua University, Beijing, China 100084;, Delft, The Netherlands 2628 CD
Venue:
The Journal of Supercomputing
Year:
2013

Citing 10
Cited 0

Reducing the bandwidth of sparse symmetric matrices

ACM '69 Proceedings of the 1969 24th national conference
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Concurrent number cruncher: a GPU implementation of a general sparse linear solver

International Journal of Parallel, Emergent and Distributed Systems
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
GPGPU-aided ensemble empirical-mode decomposition for EEG analysis during anesthesia

IEEE Transactions on Information Technology in Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multiplication ( ) on NVIDIA GPUs using CUDA. has a very low computation-data ratio and its performance is mainly bound by the memory bandwidth. We propose optimization of based on ELLPACK from two aspects: (1) enhanced performance for the dense vector by reducing cache misses, and (2) reduce accessed matrix data by index reduction. With matrix bandwidth reduction techniques, both cache usage enhancement and index compression can be enabled. For GPU with better cache support, we propose differentiated memory access scheme to avoid contamination of caches by matrix data. Performance evaluation shows that the combined speedups of proposed optimizations for GT-200 are 16% (single-precision) and 12.6% (double-precision) for GT-200 GPU, and 19% (single-precision) and 15% (double-precision) for GF-100 GPU.