Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Proceedings of the 1st International Conference on Supercomputing
Communication complexity of PRAMs
Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
Optimal broadcast and summation in the LogP model
SPAA '93 Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures
IBM Journal of Research and Development
A three-dimensional approach to parallel matrix multiplication
IBM Journal of Research and Development
LogP: a practical model of parallel computation
Communications of the ACM
Recursive array layouts and fast parallel matrix multiplication
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Matrix Multiplication and Data Routing Using a Partitioned Optical Passive Stars Network
IEEE Transactions on Parallel and Distributed Systems
Optimal and efficient algorithms for summing and prefix summing on parallel machines
Journal of Parallel and Distributed Computing
Optimal Parallel Algorithms for Solving Tridiagonal Linear Systems
Euro-Par '97 Proceedings of the Third International Euro-Par Conference on Parallel Processing
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Parallel Numerical Linear Algebra
Parallel Numerical Linear Algebra
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Parallelization of general matrix multiply routines using OpenMP
WOMPAT'04 Proceedings of the 5th international conference on OpenMP Applications and Tools: shared Memory Parallel Programming with OpenMP
Hi-index | 0.00 |
Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided.