Communication lower bounds for distributed-memory matrix multiplication

Authors:
Dror Irony;Sivan Toledo;Alexander Tiskin
Affiliations:
School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel;School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel;Department of Computer Science, University of Warwick, Coventry CV4 7AL, United Kingdom
Venue:
Journal of Parallel and Distributed Computing
Year:
2004

Citing 11
Cited 30

Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
Improving performance of linear algebra algorithms for dense matrices, using algorithmic prefetch

IBM Journal of Research and Development
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
DXML: a high-performance scientific subroutine library

Digital Technical Journal
A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication

IBM Journal of Research and Development
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology

ICS '97 Proceedings of the 11th international conference on Supercomputing
A survey of out-of-core algorithms in numerical linear algebra

External memory algorithms
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
Erratum: Bulk-synchronous Parallel Multiplication of Boolean Matrices

ICAL '99 Proceedings of the 26th International Colloquium on Automata, Languages and Programming
Bulk-Synchronous Parallel Multiplication of Boolean Matrices

ICALP '98 Proceedings of the 25th International Colloquium on Automata, Languages and Programming

Matrix product on heterogeneous master-worker platforms

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
A Bridging Model for Multi-core Computing

ESA '08 Proceedings of the 16th annual European symposium on Algorithms
A unified model for multicore architectures

IFMT '08 Proceedings of the 1st international forum on Next-generation multicore/manycore technologies
Communication-optimal parallel and sequential Cholesky decomposition: extended abstract

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Evaluating multicore algorithms on the unified memory model

Scientific Programming - Software Development for Multi-core Computing Systems
Matrix exponentials and parallel prefix computation in a quantum control problem

Parallel Computing
Brief announcement: Lower bounds on communication for sparse Cholesky factorization of a model problem

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Algorithmic issues in grid computing

Algorithms and theory of computation handbook
A bridging model for multi-core computing

Journal of Computer and System Sciences
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Brief announcement: communication bounds for heterogeneous architectures

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Balance principles for algorithm-architecture co-design

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Communication-optimal Parallel and Sequential Cholesky Decomposition

SIAM Journal on Scientific Computing
Worst-case optimal join algorithms: [extended abstract]

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Space-round tradeoffs for MapReduce computations

Proceedings of the 26th ACM international conference on Supercomputing
Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems

Proceedings of the 26th ACM international conference on Supercomputing
Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal parallel algorithm for strassen's matrix multiplication

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
CALU: A Communication Optimal LU Factorization Algorithm

SIAM Journal on Matrix Analysis and Applications
Communication-avoiding parallel strassen: implementation and performance

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Avoiding communication through a multilevel LU factorization

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A lower bound technique for communication on BSP with application to the FFT

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Graph expansion analysis for communication costs of fast rectangular matrix multiplication

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
ℓ2/ℓ2-Foreach sparse recovery with low risk

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part I
Tight bounds for low dimensional star stencils in the external memory model

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Communication costs of Strassen's matrix multiplication

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present lower bounds on the amount of communication that matrix multiplication algorithms must perform on a distributed-memory parallel computer. We denote the number of processors by P and the dimension of square matrices by n. We show that the most widely used class of algorithms, the so-called two-dimensional (2D) algorithms, are optimal, in the sense that in any algorithm that only uses O(n^2/P) words of memory per processor, at least one processor must send or receive @W(n^2/P^1^/^2) words. We also show that algorithms from another class, the so-called three-dimensional (3D) algorithms, are also optimal. These algorithms use replication to reduce communication. We show that in any algorithm that uses O(n^2/P^2^/^3) words of memory per processor, at least one processor must send or receive @W(n^2/P^2^/^3) words. Furthermore, we show a continuous tradeoff between the size of local memories and the amount of communication that must be performed. The 2D and 3D bounds are essentially instantiations of this tradeoff. We also show that if the input is distributed across the local memories of multiple nodes without replication, then @W(n^2) words must cross any bisection cut of the machine. All our bounds apply only to conventional @Q(n^3) algorithms. They do not apply to Strassen's algorithm or other o(n^3) algorithms.