Tuning collective communication for Partitioned Global Address Space programming models

Authors:
Rajesh Nishtala;Yili Zheng;Paul H. Hargrove;Katherine A. Yelick
Affiliations:
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720;CRD/NERSC, Lawrence Berkeley National Laboratory Berkeley, Berkeley, CA 94720 and Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA
Venue:
Parallel Computing
Year:
2011

Citing 12
Cited 1

Programming with POSIX threads

Programming with POSIX threads
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Titanium Language Reference Manual

Titanium Language Reference Manual
GASNet Specification, v1.1

GASNet Specification, v1.1
SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Automatically tuning collective communication for one-sided programming models

Automatically tuning collective communication for one-sided programming models

Congestion avoidance on manycore high performance computing systems

Proceedings of the 26th ACM international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.