Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
An HPF compiler for the IBM SP2
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers
ICS '95 Proceedings of the 9th international conference on Supercomputing
Global communication analysis and optimization
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
A Unified Framework for Optimizing Communication in Data-Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
A global communication optimization technique based on data-flow analysis and linear algebra
ACM Transactions on Programming Languages and Systems (TOPLAS)
Communication optimizations for parallel C programs
Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Minimizing Data and Synchronization Costs in One-Way Communication
IEEE Transactions on Parallel and Distributed Systems
Efficient and precise array access analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying the Effects of Communication Optimizations
ICPP '97 Proceedings of the international Conference on Parallel Processing
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
GASNet Specification, v1.1
A Multi-Platform Co-Array Fortran Compiler
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Support for Irregular Computations in a High-Level Language
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Effective communication coalescing for data-parallel applications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Transformations to Parallel Codes for Communication-Computation Overlap
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Optimizing communication overlap for high-speed networks
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Runtime optimization of vector operations on large scale SMP clusters
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hybrid PGAS runtime support for multicore nodes
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Hi-index | 0.00 |
Effective use of communication networks is critical to the performance and scalability of parallel applications. Partitioned Global Address Space languages like UPC bring the promise of performance and programmer productivity. Studies of well-tuned programs have suggested that PGAS languages are effective at utilizing modern networks because their one-sided communication is a good match to the underlying network hardware. An open question is whether the manual optimizations required to achieve good performance can be performed automatically by the compiler in a performance portable manner. In this paper we present a compiler and runtime optimization framework for loops containing communication operations. Our framework performs compile time message vectorization and strip-mining and defers until runtime the selection of the actual communication operations. At runtime, the communication requirements of the program are analyzed, and communication is instantiated and scheduled based on highly tuned network and application performance models. The runtime analysis takes into account network flow control and quality-of-service restrictions, and it is able to select from a large class of available communication primitives the communication schedule best suited for the dynamic combination of input size and system parameters. The results indicate that our framework produces code that scales and performs better than that of manually optimized implementations. Our approach not only improves performance, but increases programmer productivity as well.