A Flexible Class of Parallel Matrix Multiplication Algorithms

Authors:
Affiliations:
Venue:
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Year:
1998

Citing 9
Cited 11

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Level 3 BLAS for distributed memory concurrent computers

Environments and tools for parallel scientific computing
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
Using PLAPACK: parallel linear algebra package

Using PLAPACK: parallel linear algebra package
Parallel Implementation of BLAS: General Techniques for Level 3 BLAS

Parallel Implementation of BLAS: General Techniques for Level 3 BLAS
Parallel Matrix Distributions: Have we been doing it all wrong?

Parallel Matrix Distributions: Have we been doing it all wrong?
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm

Scaling Simulation of the Fusing-Restricted Reconfigurable Mesh

IEEE Transactions on Parallel and Distributed Systems
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
A Family of High-Performance Matrix Multiplication Algorithms

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Generating parallel applications for distributed memory systems using aspects, components, and patterns

Proceedings of the 6th workshop on Aspects, components, and patterns for infrastructure software
High performance dense linear algebra on a spatially distributed processor

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Combining building blocks for parallel multi-level matrix multiplication

Parallel Computing
Using simulation to design extremescale applications and architectures: programming model exploration

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Toward scalable matrix multiply on multithreaded architectures

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Parallel implementation of the sherman-morrison matrix inverse algorithm

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Communication optimal parallel multiplication of sparse random matrices

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explains why parallel implementation of matrix multiplication--a seemingly simple algorithm that can be expressed as one statement and three nested loops--is complex: Practical algorithms that use matrix multiplication tend to use matrices of disparate shapes, and the shape of the matrices can significantly impact the performance of matrix multiplication. We provide a class of algorithms that covers the spectrum of shapes encountered and demonstrate that good performance can be attained if the right algorithm is chosen. While the paper resolves a number of issues, it concludes with discussion of a number of directions yet to be pursued.