Improving communication performance in dense linear algebra via topology aware collectives

Authors:
Edgar Solomonik;Abhinav Bhatele;James Demmel
Affiliations:
University of California at Berkeley, Berkeley, CA;Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA;University of California at Berkeley, Berkeley, CA
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 15
Cited 4

Optimum Broadcasting and Personalized Communication in Hypercubes

IEEE Transactions on Computers
Communication complexity of PRAMs

Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The communication challenge for MPP: Intel Paragon and Meiko CS-2

Parallel Computing
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
A three-dimensional approach to parallel matrix multiplication

IBM Journal of Research and Development
ScaLAPACK user's guide

ScaLAPACK user's guide
Collective Communication in Wormhole-Routed Massively Parallel Computers

Computer
Broadcasting on Meshes with Worm-Hole Routing

Broadcasting on Meshes with Worm-Hole Routing
Performance analysis of MPI collective operations

Cluster Computing
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
Technology-Driven, Highly-Scalable Dragonfly Topology

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Communication avoiding Gaussian elimination

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Architectures for Extreme-Scale Computing

Computer

Mapping applications with collectives over sub-communicators on torus networks

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Communication-avoiding parallel strassen: implementation and performance

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Communication optimal parallel multiplication of sparse random matrices

Proceedings of the twenty-fifth annual ACM symposium on Parallelism in algorithms and architectures
The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are significantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP-based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.