Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks

Authors:
Aydin Buluç;Jeremy T. Fineman;Matteo Frigo;John R. Gilbert;Charles E. Leiserson
Affiliations:
University of California, Santa Barbara, Santa Barbara, CA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA;Cilk Arts, Inc., Burlington, MA, USA;University of California, Santa Barbara, Santa Barbara, CA, USA;Massachusetts Institute of Technology, Cambridge, MA, USA
Venue:
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Year:
2009

Citing 28
Cited 10

Direct methods for sparse matrices

Direct methods for sparse matrices
Introduction to algorithms

Introduction to algorithms
Sparse matrices in matlab: design and implementation

SIAM Journal on Matrix Analysis and Applications
Cilk: an efficient multithreaded runtime system

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Programming parallel algorithms

Communications of the ACM
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Improving the memory-system performance of sparse-matrix vector multiplication

IBM Journal of Research and Development
Geometric Mesh Partitioning: Implementation and Experiments

SIAM Journal on Scientific Computing
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
The C++ Programming Language

The C++ Programming Language
Computer Solution of Large Sparse Positive Definite

Computer Solution of Large Sparse Positive Definite
A Fine-Grain Hypergraph Model for 2D Decomposition of Sparse Matrices

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Reducing the bandwidth of sparse symmetric matrices

ACM '69 Proceedings of the 1969 24th national conference
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
A Two-Dimensional Data Distribution Method for Parallel Sparse Matrix-Vector Multiplication

SIAM Review
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
Stanford WebBase components and applications

ACM Transactions on Internet Technology (TOIT)
Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms

Proceedings of the 2006 workshop on Memory system performance and correctness
Accelerating sparse matrix computations via data compression

Proceedings of the 20th annual international conference on Supercomputing
Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)

Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2)
When cache blocking of sparse matrix vector multiply works and why

Applicable Algebra in Engineering, Communication and Computing
Analyzing block locality in Morton-order and Morton-hybrid matrices

ACM SIGARCH Computer Architecture News
Converting to and from Dilated Integers

IEEE Transactions on Computers
Optimizing sparse matrix-vector multiplication using index and value compression

Proceedings of the 5th conference on Computing frontiers
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Reducers and other Cilk++ hyperobjects

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
Sparse matrices in Matlab*P: design and implementation

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Exact sparse matrix-vector multiplication on GPU's and multicore architectures

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Hierarchical Diagonal Blocking and Precision Reduction Applied to Combinatorial Multigrid

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Two-dimensional cache-oblivious sparse matrix-vector multiplication

Parallel Computing
The Combinatorial BLAS: design, implementation, and applications

International Journal of High Performance Computing Applications
An object-oriented bulk synchronous parallel library for multicore programming

Concurrency and Computation: Practice & Experience
Sparse matrix-vector multiply on the HICAMP architecture

Proceedings of the 26th ACM international conference on Supercomputing
An improved sparse matrix-vector multiply based on recursive sparse blocks layout

LSSC'11 Proceedings of the 8th international conference on Large-Scale Scientific Computing
Breaking the speed and scalability barriers for graph exploration on distributed-memory machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
yaSpMV: yet another SpMV framework on GPUs

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a storage format for sparse matrices, called compressed sparse blocks (CSB), which allows both Ax and A,x to be computed efficiently in parallel, where A is an n×n sparse matrix with nnzen nonzeros and x is a dense n-vector. Our algorithms use Θ(nnz) work (serial running time) and Θ(√nlgn) span (critical-path length), yielding a parallelism of Θ(nnz/√nlgn), which is amply high for virtually any large matrix. The storage requirement for CSB is the same as that for the more-standard compressed-sparse-rows (CSR) format, for which computing Ax in parallel is easy but A,x is difficult. Benchmark results indicate that on one processor, the CSB algorithms for Ax and A,x run just as fast as the CSR algorithm for Ax, but the CSB algorithms also scale up linearly with processors until limited by off-chip memory bandwidth.