Matrix multiplication via arithmetic progressions
STOC '87 Proceedings of the nineteenth annual ACM symposium on Theory of computing
Communication complexity of PRAMs
Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
A bridging model for parallel computation
Communications of the ACM
On the additive complexity of 2 x 2 matrix multiplication
Information Processing Letters
A three-dimensional approach to parallel matrix multiplication
IBM Journal of Research and Development
ScaLAPACK user's guide
A scalable parallel Strassen's matrix multiplication algorithm for distributed-memory computers
SAC '95 Proceedings of the 1995 ACM symposium on Applied computing
Accuracy and Stability of Numerical Algorithms
Accuracy and Stability of Numerical Algorithms
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication
Journal of Parallel and Distributed Computing
Concurrency and Computation: Practice & Experience
Group-theoretic Algorithms for Matrix Multiplication
FOCS '05 Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science
Fast matrix multiplication is stable
Numerische Mathematik
Numerische Mathematik
Combining building blocks for parallel multi-level matrix multiplication
Parallel Computing
Graph expansion and communication costs of fast matrix multiplication: regular submission
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
The Future of Computing Performance: Game Over or Next Level?
The Future of Computing Performance: Game Over or Next Level?
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
A tensor product formulation of Strassen's matrix multiplication algorithm with memory reduction
IPPS '93 Proceedings of the 1993 Seventh International Parallel Processing Symposium
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-avoiding parallel strassen: implementation and performance
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication
Journal of the ACM (JACM)
Graph expansion analysis for communication costs of fast rectangular matrix multiplication
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Work-efficient matrix inversion in polylogarithmic time
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication optimal parallel multiplication of sparse random matrices
Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication costs of Strassen's matrix multiplication
Communications of the ACM
Hi-index | 0.02 |
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.