clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Authors:
Bor-Yiing Su;Kurt Keutzer
Affiliations:
University of California, Berkeley, Berkeley, CA, USA;University of California, Berkeley, Berkeley, CA, USA
Venue:
Proceedings of the 26th ACM international conference on Supercomputing
Year:
2012

Citing 14
Cited 3

Internet Streaming SIMD Extensions

Computer
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementing sparse matrix-vector multiplication on throughput-oriented processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Improving the Performance of the Sparse Matrix Vector Product with GPUs

CIT '10 Proceedings of the 2010 10th IEEE International Conference on Computer and Information Technology
Auto-Tuning CUDA Parameters for Sparse Matrix-Vector Multiplication on GPUs

ICCIS '10 Proceedings of the 2010 International Conference on Computational and Information Sciences
Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)
OpenCL Programming Guide

OpenCL Programming Guide
Reduced-Bandwidth Multithreaded Algorithms for Sparse Matrix-Vector Multiplication

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Automatically tuning sparse matrix-vector multiplication for GPU architectures

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Accelerating sparse matrix-vector multiplication on GPUs using bit-representation-optimized schemes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse linear algebra. With the appearance of OpenCL, a programming language that standardizes parallel programming across a wide variety of heterogeneous platforms, we are able to optimize the SpMV kernel on many different platforms. In this paper, we propose a new sparse matrix format, the Cocktail Format, to take advantage of the strengths of many different sparse matrix formats. Based on the Cocktail Format, we develop the clSpMV framework that is able to analyze all kinds of sparse matrices at runtime, and recommend the best representations of the given sparse matrices on different platforms. Although solutions that are portable across diverse platforms generally provide lower performance when compared to solutions that are specialized to particular platforms, our experimental results show that clSpMV can find the best representations of the input sparse matrices on both Nvidia and AMD platforms, and deliver 83% higher performance compared to the vendor optimized CUDA implementation of the proposed hybrid sparse format in [3], and 63.6% higher performance compared to the CUDA implementations of all sparse formats in [3].