Caching-Efficient Multithreaded Fast Multiplication of Sparse Matrices

Authors:
Affiliations:
Venue:
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Year:
1998

Citing 5
Cited 5

Data structures for compact sparse matrices representation

Advances in Engineering Software
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Two Fast Algorithms for Sparse Matrices: Multiplication and Permuted Transposition

ACM Transactions on Mathematical Software (TOMS)
Cache Profiling and the SPEC Benchmarks: A Case Study

Computer
Numerical Recipes: The Art of Scientific Computing with IBM PC or Macintosh

Numerical Recipes: The Art of Scientific Computing with IBM PC or Macintosh

Efficient Representation Scheme for Multidimensional Array Operations

IEEE Transactions on Computers
A Blocked All-Pairs Shortest-Path Algorithm

SWAT '00 Proceedings of the 7th Scandinavian Workshop on Algorithm Theory
Efficient Data Parallel Algorithms for Multidimensional Array Operations Based on the EKMR Scheme for Distributed Memory Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Efficient Data Compression Methods for Multidimensional Sparse Array Operations Based on the EKMR Scheme

IEEE Transactions on Computers
Data distribution schemes of sparse arrays on distributed memory multicomputers

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Several fast sequential algorithms have been proposed in the past to multiply sparse matrices. These algorithms do not explicitly address the impact of caching on performance. We show that a rather simple sequential cache-efficient algorithm provides significantly better performance than existing algorithms for sparse matrix multiplication. We then describe a multithreaded implementation of this simple algorithm and show that its performance scales well with the number of threads and CPUs. For 10% sparse, 500 脳 500 matrices, the multithreaded version running on 4--CPU systems provides more than a 41.1-fold speed increase over the well-known BLAS routine and a 14.6 fold and 44.6-fold speed increase over two other recent techniques for fast sparse matrix multiplication, both of which are relatively difficult to parallelize efficiently.