Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms

Authors:
Edgar Solomonik;James Demmel
Affiliations:
Department of Computer Science, University of California at Berkeley, Berkeley, CA;Department of Computer Science, University of California at Berkeley, Berkeley, CA
Venue:
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Year:
2011

Citing 11
Cited 8

Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
Minimizing the communication time for matrix multiplication on multiprocessors

Parallel Computing
Using MPI: portable parallel programming with the message-passing interface

Using MPI: portable parallel programming with the message-passing interface
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
ScaLAPACK user's guide

ScaLAPACK user's guide
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Communication lower bounds for distributed-memory matrix multiplication

Journal of Parallel and Distributed Computing
Fast linear algebra is stable

Numerische Mathematik
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects

Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Communication-optimal parallel algorithm for strassen's matrix multiplication

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Graph expansion and communication costs of fast matrix multiplication

Journal of the ACM (JACM)
Work-efficient matrix inversion in polylogarithmic time

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication optimal parallel multiplication of sparse random matrices

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication efficient gaussian elimination with partial pivoting using a shape morphing data layout

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
Communication costs of Strassen's matrix multiplication

Communications of the ACM
The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering

Quantified Score

Hi-index	0.02

Visualization

Abstract

Extra memory allows parallel matrix multiplication to be done with asymptotically less communication than Cannon's algorithm and be faster in practice. "3D" algorithms arrange the p processors in a 3D array, and store redundant copies of the matrices on each of p1/3 layers. "2D" algorithms such as Cannon's algorithm store a single copy of the matrices on a 2D array of processors. We generalize these 2D and 3D algorithms by introducing a new class of "2.5D algorithms". For matrix multiplication, we can take advantage of any amount of extra memory to store c copies of the data, for any c ∈ {1, 2,..., ⌊p1/3⌋}, to reduce the bandwidth cost of Cannon's algorithm by a factor of c1/2 and the latency cost by a factor c3/2. We also show that these costs reach the lower bounds, modulo polylog(p) factors. We introduce a novel algorithm for 2.5D LU decomposition. To the best of our knowledge, this LU algorithm is the first to minimize communication along the critical path of execution in the 3D case. Our 2.5D LU algorithm uses communicationavoiding pivoting, a stable alternative to partial-pivoting. We prove a novel lower bound on the latency cost of 2.5D and 3D LU factorization, showing that while c copies of the data can also reduce the bandwidth by a factor of c1/2, the latency must increase by a factor of c1/2, so that the 2D LU algorithm (c = 1) in fact minimizes latency. We provide implementations and performance results for 2D and 2.5D versions of all the new algorithms. Our results demonstrate that 2.5D matrix multiplication and LU algorithms strongly scale more efficiently than 2D algorithms. Each of our 2.5D algorithms performs over 2X faster than the corresponding 2D algorithm for certain problem sizes on 65,536 cores of a BG/P supercomputer.