Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Level 3 BLAS for distributed memory concurrent computers
Environments and tools for parallel scientific computing
IBM Journal of Research and Development
A three-dimensional approach to parallel matrix multiplication
IBM Journal of Research and Development
Using PLAPACK: parallel linear algebra package
Using PLAPACK: parallel linear algebra package
Parallel Implementation of BLAS: General Techniques for Level 3 BLAS
Parallel Implementation of BLAS: General Techniques for Level 3 BLAS
Parallel Matrix Distributions: Have we been doing it all wrong?
Parallel Matrix Distributions: Have we been doing it all wrong?
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Scaling Simulation of the Fusing-Restricted Reconfigurable Mesh
IEEE Transactions on Parallel and Distributed Systems
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
A Family of High-Performance Matrix Multiplication Algorithms
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Parallel Out-of-Core Cholesky and QR Factorization with POOCLAPACK
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Proceedings of the 6th workshop on Aspects, components, and patterns for infrastructure software
High performance dense linear algebra on a spatially distributed processor
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Combining building blocks for parallel multi-level matrix multiplication
Parallel Computing
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Toward scalable matrix multiply on multithreaded architectures
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Parallel implementation of the sherman-morrison matrix inverse algorithm
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Communication optimal parallel multiplication of sparse random matrices
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Hi-index | 0.00 |
This paper explains why parallel implementation of matrix multiplication--a seemingly simple algorithm that can be expressed as one statement and three nested loops--is complex: Practical algorithms that use matrix multiplication tend to use matrices of disparate shapes, and the shape of the matrices can significantly impact the performance of matrix multiplication. We provide a class of algorithms that covers the spectrum of shapes encountered and demonstrate that good performance can be attained if the right algorithm is chosen. While the paper resolves a number of issues, it concludes with discussion of a number of directions yet to be pursued.