Run-Time Parallelization and Scheduling of Loops
IEEE Transactions on Computers
Improving the performance of runtime parallelization
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
A Unified Framework for Optimizing Communication in Data-Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
GASNet Specification, v1.1
A Multi-Platform Co-Array Fortran Compiler
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Support for Irregular Computations in a High-Level Language
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Effective communication coalescing for data-parallel applications
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication Optimizations for Fine-Grained UPC Applications
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A UPC Runtime System Based on MPI and POSIX Threads
PDP '06 Proceedings of the 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing
Shared memory programming for large scale machines
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
The HPC Challenge (HPCC) benchmark suite
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Automatic nonblocking communication for partitioned global address space programs
Proceedings of the 21st annual international conference on Supercomputing
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Design and implementation of message-passing services for the Blue Gene/L supercomputer
IBM Journal of Research and Development
A characterization of shared data access patterns in UPC programs
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
A practical study of UPC using the NAS Parallel Benchmarks
Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
The PERCS High-Performance Interconnect
HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
Asynchronous PGAS runtime for Myrinet networks
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Improving communication in PGAS environments: static and dynamic coalescing in UPC
Proceedings of the 27th international ACM conference on International conference on supercomputing
Hi-index | 0.00 |
Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. However, fine grain accesses on shared structures have been identified as one of the main bottlenecks of PGAS languages. Manual or compiler assistance code optimization is required to avoid fine grain accesses. The downside of manually applying code transformations is the increased program complexity and hindering of the programmer productivity. On the other hand, compiler optimizations of fine grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for prefetching and coalescing of shared accesses at runtime. Larger messages decrease the impact of remote access latency and increase the efficiency of the network communication. We have implemented our optimization for the Unified Parallel C (UPC) language. An experimental evaluation on a distributed-memory environment using a Power7 cluster demonstrates the benefits of our optimization.