A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers

Authors:
J. Choi
Affiliations:
-
Venue:
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Year:
1997

Citing 4
Cited 3

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
LAPACK Working Note 96: Scalable Universal Matrix Multiplication Algorithm

LAPACK Working Note 96: Scalable Universal Matrix Multiplication Algorithm
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm

64-bit floating-point FPGA matrix multiplication

Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Memory efficient parallel matrix multiplication operation for irregular problems

Proceedings of the 3rd conference on Computing frontiers
Parallelization of divide-and-conquer eigenvector accumulation

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The author presents a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA (distribution-independent matrix multiplication algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS routine in each processor when the block size is too small as well as too large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer.