Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
What are race conditions?: Some issues and formalizations
ACM Letters on Programming Languages and Systems (LOPLAS)
Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Global communication analysis and optimization
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Shasta: a low overhead, software-only approach for supporting fine-grain shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A Unified Framework for Optimizing Communication in Data-Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Communication optimizations for parallel C programs
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Minimizing Data and Synchronization Costs in One-Way Communication
IEEE Transactions on Parallel and Distributed Systems
Efficient and precise array access analysis
ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying the Effects of Communication Optimizations
ICPP '97 Proceedings of the international Conference on Parallel Processing
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
An Evaluation of Current High-Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
GASNet Specification, v1.1
A Multi-Platform Co-Array Fortran Compiler
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Benchmark Measurements of Current UPC Platforms
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
An evaluation of global address space languages: co-array fortran and unified parallel C
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Making Sequential Consistency Practical in Titanium
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Titanium performance and potential: an NPB experimental study
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Optimizing irregular shared-memory applications for clusters
Proceedings of the 22nd annual international conference on Supercomputing
MPI-aware compiler optimizations for improving communication-computation overlap
Proceedings of the 23rd international conference on Supercomputing
Towards autotuning by alternating communication methods
Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Towards autotuning by alternating communication methods
ACM SIGMETRICS Performance Evaluation Review
Automatic communication coalescing for irregular computations in UPC language
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Improving communication in PGAS environments: static and dynamic coalescing in UPC
Proceedings of the 27th international ACM conference on International conference on supercomputing
Experiences Developing the OpenUH Compiler and Runtime Infrastructure
International Journal of Parallel Programming
Hi-index | 0.00 |
Overlapping communication with computation is an important optimization on current cluster architectures; its importance is likely to increase as the doubling of processing power far outpaces any improvements in communication latency. PGAS languages offer unique opportunities for communication overlap, because their one-sided communication model enables low overhead data transfer. Recent results have shown the value of hiding latency by manually applying language-level nonblocking data transfer routines, but this process can be both tedious and error-prone. In this paper, we present a runtime framework that automatically schedules the data transfers to achieve overlap. The optimization framework is entirely transparent to the user, and aggressively reorders and aggregates both remote puts and gets. We preserve correctness via runtime conflict checks and temporary buffers, using several techniques to lower the overhead. Experimental results on application benchmarks suggest that our framework can be very effective at hiding communication latency on clusters, improving performance over the blocking code by an average of 16% for some of the NAS Parallel Benchmarks, 48% for GUPS, and over 25% for a multi-block fluid dynamics solver. While the system is not yet as effective as aggressive manual optimization, it increases programmers' productivity by freeing them from the details of communication management.