CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations

Authors:
Hoang-Vu Dang;Bertil Schmidt
Affiliations:
-;-
Venue:
Parallel Computing
Year:
2013

Citing 12
Cited 0

Sparse matrix test problems

ACM Transactions on Mathematical Software (TOMS)
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Scalable Parallel Programming with CUDA

Queue - GPU Computing
Concurrent number cruncher: a GPU implementation of a general sparse linear solver

International Journal of Parallel, Emergent and Distributed Systems
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Improving the Performance of the Sparse Matrix Vector Product with GPUs

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Iterative sparse Matrix-Vector multiplication for integer factorization on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Accelerating Sparse Matrix Vector Multiplication in Iterative Methods Using GPU

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Scalable GPU graph traversal

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Automatically tuning sparse matrix-vector multiplication for GPU architectures

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation

IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

Existing formats for Sparse Matrix-Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision. Source code is available at https://github.com/danghvu/cudaSpmv.