Improving communication in PGAS environments: static and dynamic coalescing in UPC

Authors:
Michail Alvanos;Montse Farreras;Ettore Tiotto;José Nelson Amaral;Xavier Martorell
Affiliations:
Barcelona Supercomputer Center, Barcelona, Spain;Universitat Politècnica de Catalunya, BARCELONA, Spain;IBM Toronto Laboratory, TORONTO, Canada;University of Alberta, Edmonton, Canada;Universitat Politècnica de Catalunya, BARCELONA, Canada
Venue:
Proceedings of the 27th international ACM conference on International conference on supercomputing
Year:
2013

Citing 16
Cited 0

Run-Time Parallelization and Scheduling of Loops

IEEE Transactions on Computers
A Unified Framework for Optimizing Communication in Data-Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Compiling Global Name-Space Parallel Loops for Distributed Execution

IEEE Transactions on Parallel and Distributed Systems
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A Multi-Platform Co-Array Fortran Compiler

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Automatic Support for Irregular Computations in a High-Level Language

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Effective communication coalescing for data-parallel applications

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
X10: an object-oriented approach to non-uniform cluster computing

OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Shared memory programming for large scale machines

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Automatic nonblocking communication for partitioned global address space programs

Proceedings of the 21st annual international conference on Supercomputing
The PERCS High-Performance Interconnect

HOTI '10 Proceedings of the 2010 18th IEEE Symposium on High Performance Interconnects
PERCS: the IBM power7-IH high-performance computing system

IBM Journal of Research and Development
An early performance analysis of POWER7-IH HPC systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Automatic communication coalescing for irregular computations in UPC language

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Global Data Re-allocation via Communication Aggregation in Chapel

SBAC-PAD '12 Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of Partitioned Global Address Space (PGAS) languages is to improve programmer productivity in large scale parallel machines. However, PGAS programs may have many fine-grained shared accesses that lead to performance degradation. Manual code transformations or compiler optimizations are required to improve the performance of programs with fine-grained accesses. The downside of manual code transformations is the increased program complexity that hinders programmer productivity. On the other hand, most compiler optimizations of fine-grain accesses require knowledge of physical data mapping and the use of parallel loop constructs. This paper presents an optimization for the Unified Parallel C language that combines compile time (static) and runtime (dynamic) coalescing of shared data, without the knowledge of physical data mapping. Larger messages increase the network efficiency and static coalescing decreases the overhead of library calls. The performance evaluation uses two microbenchmarks and three benchmarks to obtain scaling and absolute performance numbers on up to 32768 cores of a Power 775 machine. Our results show that the compiler transformation results in speedups from 1.15X up to 21X compared with the baseline versions and that they achieve up to 63% the performance of the MPI versions.