Programming with POSIX threads
Programming with POSIX threads
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
Titanium Language Reference Manual
Titanium Language Reference Manual
GASNet Specification, v1.1
SUMMA: Scalable Universal Matrix Multiplication Algorithm
SUMMA: Scalable Universal Matrix Multiplication Algorithm
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Implications of application usage characteristics for collective communication offload
International Journal of High Performance Computing and Networking
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Automatically tuning collective communication for one-sided programming models
Automatically tuning collective communication for one-sided programming models
Congestion avoidance on manycore high performance computing systems
Proceedings of the 26th ACM international conference on Supercomputing
Hi-index | 0.00 |
Partitioned Global Address Space (PGAS) languages offer programmers the convenience of a shared memory programming style combined with locality control necessary to run on large-scale distributed memory systems. Even within a PGAS language programmers often need to perform global communication operations such as broadcasts or reductions, which are best performed as collective operations in which a group of threads work together to perform the operation. In this paper we consider the problem of implementing collective communication within PGAS languages and explore some of the design trade-offs in both the interface and implementation. In particular, PGAS collectives have semantic issues that are different than in send-receive style message passing programs, and different implementation approaches that take advantage of the one-sided communication style in these languages. We present an implementation framework for PGAS collectives as part of the GASNet communication layer, which supports shared memory, distributed memory and hybrids. The framework supports a broad set of algorithms for each collective, over which the implementation may be automatically tuned. Finally, we demonstrate the benefit of optimized GASNet collectives using application benchmarks written in UPC, and demonstrate that the GASNet collectives can deliver scalable performance on a variety of state-of-the-art parallel machines including a Cray XT4, an IBM BlueGene/P, and a Sun Constellation system with InfiniBand interconnect.