Communication complexity of PRAMs
Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch
IBM Journal of Research and Development
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
DXML: a high-performance scientific subroutine library
Digital Technical Journal
IBM Journal of Research and Development
A three-dimensional approach to parallel matrix multiplication
IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology
ICS '97 Proceedings of the 11th international conference on Supercomputing
A survey of out-of-core algorithms in numerical linear algebra
External memory algorithms
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
Erratum: Bulk-synchronous Parallel Multiplication of Boolean Matrices
ICAL '99 Proceedings of the 26th International Colloquium on Automata, Languages and Programming
Bulk-Synchronous Parallel Multiplication of Boolean Matrices
ICALP '98 Proceedings of the 25th International Colloquium on Automata, Languages and Programming
Matrix product on heterogeneous master-worker platforms
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A Bridging Model for Multi-core Computing
ESA '08 Proceedings of the 16th annual European symposium on Algorithms
A unified model for multicore architectures
IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Evaluating multicore algorithms on the unified memory model
Scientific Programming - Software Development for Multi-core Computing Systems
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Algorithmic issues in grid computing
Algorithms and theory of computation handbook
A bridging model for multi-core computing
Journal of Computer and System Sciences
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Brief announcement: communication bounds for heterogeneous architectures
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Balance principles for algorithm-architecture co-design
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Communication-optimal Parallel and Sequential Cholesky Decomposition
SIAM Journal on Scientific Computing
Worst-case optimal join algorithms: [extended abstract]
PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Space-round tradeoffs for MapReduce computations
Proceedings of the 26th ACM international conference on Supercomputing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems
Proceedings of the 26th ACM international conference on Supercomputing
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal parallel algorithm for strassen's matrix multiplication
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal Parallel and Sequential QR and LU Factorizations
SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
Communication-avoiding parallel strassen: implementation and performance
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Avoiding communication through a multilevel LU factorization
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A lower bound technique for communication on BSP with application to the FFT
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Graph expansion analysis for communication costs of fast rectangular matrix multiplication
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
ℓ2/ℓ2-Foreach sparse recovery with low risk
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Tight bounds for low dimensional star stencils in the external memory model
WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Communication costs of Strassen's matrix multiplication
Communications of the ACM
Hi-index | 0.02 |
We present lower bounds on the amount of communication that matrix multiplication algorithms must perform on a distributed-memory parallel computer. We denote the number of processors by P and the dimension of square matrices by n. We show that the most widely used class of algorithms, the so-called two-dimensional (2D) algorithms, are optimal, in the sense that in any algorithm that only uses O(n^2/P) words of memory per processor, at least one processor must send or receive @W(n^2/P^1^/^2) words. We also show that algorithms from another class, the so-called three-dimensional (3D) algorithms, are also optimal. These algorithms use replication to reduce communication. We show that in any algorithm that uses O(n^2/P^2^/^3) words of memory per processor, at least one processor must send or receive @W(n^2/P^2^/^3) words. Furthermore, we show a continuous tradeoff between the size of local memories and the amount of communication that must be performed. The 2D and 3D bounds are essentially instantiations of this tradeoff. We also show that if the input is distributed across the local memories of multiple nodes without replication, then @W(n^2) words must cross any bisection cut of the machine. All our bounds apply only to conventional @Q(n^3) algorithms. They do not apply to Strassen's algorithm or other o(n^3) algorithms.