Communication complexity of PRAMs
Theoretical Computer Science - Special issue: Fifteenth international colloquium on automata, languages and programming, Tampere, Finland, July 1988
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A three-dimensional approach to parallel matrix multiplication
IBM Journal of Research and Development
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
An Evaluation of Current High-Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A cellular computer to implement the kalman filter algorithm
A cellular computer to implement the kalman filter algorithm
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Elemental: A New Framework for Distributed Memory Dense Matrix Computations
ACM Transactions on Mathematical Software (TOMS)
The Servet 3.0 benchmark suite: Characterization of network performance degradation
Computers and Electrical Engineering
Hi-index | 0.00 |
To efficiently scale dense linear algebra problems to future exascale systems, communication cost must be avoided or overlapped. Communication-avoiding 2.5D algorithms improve scalability by reducing inter-processor data transfer volume at the cost of extra memory usage. Communication overlap attempts to hide messaging latency by pipelining messages and overlapping with computational work. We study the interaction and compatibility of these two techniques for two matrix multiplication algorithms (Cannon and SUMMA), triangular solve, and Cholesky factorization. For each algorithm, we construct a detailed performance model that considers both critical path dependencies and idle time. We give novel implementations of 2.5D algorithms with overlap for each of these problems. Our software employs UPC, a partitioned global address space (PGAS) language that provides fast one-sided communication. We show communication avoidance and overlap provide a cumulative benefit as core counts scale, including results using over 24K cores of a Cray XE6 system.