Communication avoiding and overlapping for numerical linear algebra

Authors:
Evangelos Georganas;Jorge González-Domínguez;Edgar Solomonik;Yili Zheng;Juan Touriño;Katherine Yelick
Affiliations:
University of California at Berkeley, Berkeley, CA;University of A Coruña, A Coruña, Spain;University of California at Berkeley, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of A Coruña, A Coruña, Spain;University of California at Berkeley, Berkeley, CA and Lawrence Berkeley National Laboratory, Berkeley, CA
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 11
Cited 1

Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Minimizing the communication time for matrix multiplication on multiprocessors

Parallel Computing
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A preliminary evaluation of the hardware acceleration of the cray gemini interconnect for PGAS languages and comparison with MPI

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Elemental: A New Framework for Distributed Memory Dense Matrix Computations

ACM Transactions on Mathematical Software (TOMS)

The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communication-avoiding 2.5D algorithms improve scalability by reducing inter-processor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix multiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.