Performance portable optimizations for loops containing communication operations

Authors:
Costin Iancu;Wei Chen;Katherine Yelick
Affiliations:
LBNL, Berkeley, CA, USA;UC Berkeley, Berkeley, USA;UC Berkeley, Berkeley, USA
Venue:
Proceedings of the 22nd annual international conference on Supercomputing
Year:
2008

Citing 21
Cited 2

Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
An HPF compiler for the IBM SP2

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers

ICS '95 Proceedings of the 9th international conference on Supercomputing
Global communication analysis and optimization

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
A Unified Framework for Optimizing Communication in Data-Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
A global communication optimization technique based on data-flow analysis and linear algebra

ACM Transactions on Programming Languages and Systems (TOPLAS)
Communication optimizations for parallel C programs

Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Minimizing Data and Synchronization Costs in One-Way Communication

IEEE Transactions on Parallel and Distributed Systems
Efficient and precise array access analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying the Effects of Communication Optimizations

ICPP '97 Proceedings of the international Conference on Parallel Processing
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
GASNet Specification, v1.1

GASNet Specification, v1.1
A Multi-Platform Co-Array Fortran Compiler

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Support for Irregular Computations in a High-Level Language

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Effective communication coalescing for data-parallel applications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect

HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Optimizing communication overlap for high-speed networks

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Runtime optimization of vector operations on large scale SMP clusters

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid PGAS runtime support for multicore nodes

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effective use of communication networks is critical to the performance and scalability of parallel applications. Partitioned Global Address Space languages like UPC bring the promise of performance and programmer productivity. Studies of well-tuned programs have suggested that PGAS languages are effective at utilizing modern networks because their one-sided communication is a good match to the underlying network hardware. An open question is whether the manual optimizations required to achieve good performance can be performed automatically by the compiler in a performance portable manner. In this paper we present a compiler and runtime optimization framework for loops containing communication operations. Our framework performs compile time message vectorization and strip-mining and defers until runtime the selection of the actual communication operations. At runtime, the communication requirements of the program are analyzed, and communication is instantiated and scheduled based on highly tuned network and application performance models. The runtime analysis takes into account network flow control and quality-of-service restrictions, and it is able to select from a large class of available communication primitives the communication schedule best suited for the dynamic combination of input size and system parameters. The results indicate that our framework produces code that scales and performs better than that of manually optimized implementations. Our approach not only improves performance, but increases programmer productivity as well.