Implementing sparse matrix-vector multiplication on throughput-oriented processors

Authors:
Nathan Bell;Michael Garland
Affiliations:
NVIDIA Research;NVIDIA Research
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 15
Cited 67

Implementation of a portable nested data-parallel language

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method

SIAM Journal on Scientific Computing
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors

Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scalable Parallel Programming with CUDA

Queue - GPU Computing
NVIDIA Tesla: A Unified Graphics and Computing Architecture

IEEE Micro
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Concurrent number cruncher: a GPU implementation of a general sparse linear solver

International Journal of Parallel, Emergent and Distributed Systems

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Proceedings of the 37th annual international symposium on Computer architecture
Exact sparse matrix-vector multiplication on GPU's and multicore architectures

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Understanding throughput-oriented architectures

Communications of the ACM
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
From Sparse Matrix to Optimal GPU CUDA Sparse Matrix Vector Product Implementation

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Accelerating Haskell array codes with multicore GPUs

Proceedings of the sixth workshop on Declarative aspects of multicore programming
EigenCFA: accelerating flow analysis with GPUs

Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Fast sparse matrix-vector multiplication on GPUs: implications for graph mining

Proceedings of the VLDB Endowment
Approximate Spreading Activation for Efficient Knowledge Retrieval from Large Datasets

Proceedings of the 2011 conference on Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Nets
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
Performance and numerical accuracy evaluation of heterogeneous multicore systems for Krylov orthogonal basis computation

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Considerations when evaluating microprocessor platforms

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
A model-driven partitioning and auto-tuning integrated framework for sparse matrix-vector multiplication on GPUs

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
Scalable multi-coloring preconditioning for multi-core CPUs and GPUs

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
CRSD: application specific auto-tuning of SpMV for diagonal sparse matrices

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Parallel GMRES implementation for solving sparse linear systems on GPU clusters

Proceedings of the 19th High Performance Computing Symposia
Towards accelerating irregular EDA applications with GPUs

Integration, the VLSI Journal
Accelerating the Explicitly Restarted Arnoldi Method with GPUs Using an Autotuned Matrix Vector Product

SIAM Journal on Scientific Computing
PyCUDA and PyOpenCL: A scripting-based approach to GPU run-time code generation

Parallel Computing
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization of sparse matrix-vector multiplication using reordering techniques on GPUs

Microprocessors & Microsystems
High-performance sparse matrix-vector multiplication on GPUs for structured grid computations

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Efficient matrix-encoded grammars and low latency parallelization strategies for CYK

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Thermal management of a many-core processor under fine-grained parallelism

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Design patterns for scientific computations on sparse matrices

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Sparse matrix-vector multiply on the HICAMP architecture

Proceedings of the 26th ACM international conference on Supercomputing
clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Proceedings of the 26th ACM international conference on Supercomputing
Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach

Parallel Computing
RETRACTED: Color and texture analysis on emerging parallel architectures

International Journal of High Performance Computing Applications
Parallel smoothers for matrix-based geometric multigrid methods on locally refined meshes using multicore CPUs and GPUs

Facing the Multicore-Challenge II
Virtual try on: an application in need of GPU optimization

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
C-DAC's efforts: application kernels on HPC cluster with GPU accelerators

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Solution to PDEs using radial basis function finite-differences (RBF-FD) on multiple GPUs

Journal of Computational Physics
Improving GPU sparse matrix-vector multiplication for probabilistic model checking

SPIN'12 Proceedings of the 19th international conference on Model Checking Software
GPU acceleration of probabilistic frequent itemset mining from uncertain databases

Proceedings of the 21st ACM international conference on Information and knowledge management
Optimization of power consumption in the iterative solution of sparse linear systems on graphics processors

Computer Science - Research and Development
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Iterative Krylov solution methods for geophysical electromagnetic simulations on throughput-oriented processing units

International Journal of High Performance Computing Applications
Fast and accurate GPU-based simulation of virtual garments

Proceedings of the 11th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry
Automated artifact-free seafloor surface reconstruction with two-step ODETLAP

SIGSPATIAL Special
Circuit simulation via matrix exponential method for stiffness handling and parallel processing

Proceedings of the International Conference on Computer-Aided Design
GPU-accelerated preconditioned iterative linear solvers

The Journal of Supercomputing
Fast and memory-efficient minimum spanning tree on the GPU

International Journal of Computational Science and Engineering
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

The Journal of Supercomputing
The BiConjugate gradient method on GPUs

The Journal of Supercomputing
Comparison based sorting for systems with multiple GPUs

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
GPU implementation of a novel hybrid lattice Boltzmann method for non-isothermal flows

Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies
Efficient sparse matrix-vector multiplication on x86-based many-core processors

Proceedings of the 27th international ACM conference on International conference on supercomputing
Optimising purely functional GPU programs

Proceedings of the 18th ACM SIGPLAN international conference on Functional programming
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Iterative numerical methods for sampling from high dimensional Gaussian distributions

Statistics and Computing
Assessing the performance of OpenMP programs on the intel xeon phi

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Efficient 3D stencil computations using CUDA

Parallel Computing
Divergence-aware warp scheduling

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Non-affine Extensions to Polyhedral Code Generation

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Exascale design space exploration and co-design

Future Generation Computer Systems
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Parallel Computing
GPU-based iterative transmission reconstruction in 3D ultrasound computer tomography

Journal of Parallel and Distributed Computing
Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU

Journal of Parallel and Distributed Computing
Design patterns for sparse-matrix computations on hybrid CPU/GPU platforms

Scientific Programming
Algebraic flux correction for nonconforming finite element discretizations of scalar transport problems

Computing

Quantified Score

Hi-index	0.02

Visualization

Abstract

Sparse matrix-vector multiplication (SpMV) is of singular importance in sparse linear algebra. In contrast to the uniform regularity of dense linear algebra, sparse operations encounter a broad spectrum of matrices ranging from the regular to the highly irregular. Harnessing the tremendous potential of throughput-oriented processors for sparse operations requires that we expose substantial fine-grained parallelism and impose sufficient regularity on execution paths and memory access patterns. We explore SpMV methods that are well-suited to throughput-oriented architectures like the GPU and which exploit several common sparsity classes. The techniques we propose are efficient, successfully utilizing large percentages of peak bandwidth. Furthermore, they deliver excellent total throughput, averaging 16 GFLOP/s and 10 GFLOP/s in double precision for structured grid and unstructured mesh matrices, respectively, on a GeForce GTX 285. This is roughly 2.8 times the throughput previously achieved on Cell BE and more than 10 times that of a quad-core Intel Clovertown system.