Parallel Complexity of Matrix Multiplication

Authors:
Eunice E. Santos
Affiliations:
Department of Computer Science, Virginia Polytechnic Institute & State University Blacksburg, VA 24061 santos@cs.vt.edu
Venue:
The Journal of Supercomputing
Year:
2003

Citing 14
Cited 1

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Domain decomposition in distributed and shared memory environments I: uniform decomposition and performance analysis for the NCUBE and JPL Mark IIIfp hypercubes

Proceedings of the 1st International Conference on Supercomputing
Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
Optimal broadcast and summation in the LogP model

SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
LogP: a practical model of parallel computation

Communications of the ACM
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Matrix Multiplication and Data Routing Using a Partitioned Optical Passive Stars Network

IEEE Transactions on Parallel and Distributed Systems
Optimal and efficient algorithms for summing and prefix summing on parallel machines

Journal of Parallel and Distributed Computing
Optimal Parallel Algorithms for Solving Tridiagonal Linear Systems

Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Parallel Numerical Linear Algebra

Parallel Numerical Linear Algebra
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm

Parallelization of general matrix multiply routines using OpenMP

WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided.