Optimal sparse matrix dense vector multiplication in the I/O-model

Authors:
Michael A. Bender;Gerth Stølting Brodal;Rolf Fagerberg;Riko Jacob;Elias Vicari
Affiliations:
Stony Brook University, Stony Brook, NY;University of Aarhus, Aarhus, Denmark;University of Southern Denmark, Odense M, Denmark;ETH Zurich, Zurich, Switzerland;ETH Zurich, Zurich, Switzerland
Venue:
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Year:
2007

Citing 12
Cited 9

The input/output complexity of sorting and related problems

Communications of the ACM
Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems

SIAM Journal on Computing
External memory algorithms and data structures

External memory algorithms
On showing lower bounds for external-memory computational geometry problems

External memory algorithms
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
PSBLAS: a library for parallel linear algebra computation on sparse matrices

ACM Transactions on Mathematical Software (TOMS)
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
Optimizing the performance of sparse matrix-vector multiplication

Optimizing the performance of sparse matrix-vector multiplication
Multi-linear formulas for permanent and determinant are of super-polynomial size

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Automatic performance tuning of sparse matrix kernels

Automatic performance tuning of sparse matrix kernels
Cache-aware and cache-oblivious adaptive sorting

ICALP'05 Proceedings of the 32nd international conference on Automata, Languages and Programming

Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Low depth cache-oblivious algorithms

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Evaluating non-square sparse bilinear forms on multiple vector pairs in the I/O-model

MFCS'10 Proceedings of the 35th international conference on Mathematical foundations of computer science
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
The i/o complexity of sparse matrix dense matrix multiplication

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Managing data-movement for effective shared-memory parallelization of out-of-core sparse solvers

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
X-Stream: edge-centric graph processing using streaming partitions

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

Quantified Score

Hi-index	0.01

Visualization

Abstract

We analyze the problem of sparse-matrix dense-vector multiplication (SpMV) in the I/O-model. The task of SpMV is to compute y := Ax, where A is a sparse N x N matrix and x and y are vectors. Here, sparsity is expressed by the parameter k that states that A has a total of at most kN nonzeros, i.e., an average number of k nonzeros per column. The extreme choices for parameter k are well studied special cases, namely for k=1 permuting and for k=N dense matrix-vector multiplication. We study the worst-case complexity of this computational task, i.e., what is the best possible upper bound on the number of I/Os depending on k and N only. We determine this complexity up to a constant factor for large ranges of the parameters. By our arguments, we find that most matrices with kN nonzeros require this number of I/Os, even if the program may depend on the structure of the matrix. The model of computation for the lower bound is a combination of the I/O-models of Aggarwal and Vitter, and of Hong and Kung. We study two variants of the problem, depending on the memory layout of A. If A is stored in column major layout, SpMV has I/O complexity Θ(min{kNB(1+logM/BNmax{M,k}), kN}) for k ≤ N1-ε and any constant 1 ε 0. If the algorithm can choose the memory layout, the I/O complexity of SpMV is Θ(min{kNB(1+logM/BNkM), kN]) for k ≤ 3√N. In the cache oblivious setting with tall cache assumption M ≥ B1+ε, the I/O complexity is Ο(kNB(1+logM/B Nk)) for A in column major layout.