Performance without pain = productivity: data layout and collective communication in UPC

Authors:
Rajesh Nishtala;George Almasi;Calin Cascaval
Affiliations:
UC Berkeley, Berkeley, CA, USA;IBM T.J. Watson Research Center, Yorktown Heights, NY, USA;IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Venue:
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Year:
2008

Citing 21
Cited 10

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
ZPL: A Machine Independent Programming Language for Parallel Computers

IEEE Transactions on Software Engineering - Special issue on architecture-independent languages and software tools for parallel processing
A high performance parallel algorithm for 1-D FFT

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Proposal for a Set of Parallel Basic Linear Algebra Subprograms

PARA '95 Proceedings of the Second International Workshop on Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Titanium Language Reference Manual

Titanium Language Reference Manual
SUMMA: Scalable Universal Matrix Multiplication Algorithm

SUMMA: Scalable Universal Matrix Multiplication Algorithm
A cellular computer to implement the kalman filter algorithm

A cellular computer to implement the kalman filter algorithm
An evaluation of global address space languages: co-array fortran and unified parallel C

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
Shared memory programming for large scale machines

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
Design and implementation of message-passing services for the Blue Gene/L supercomputer

IBM Journal of Research and Development
Compilation techniques for partitioned global address space languages

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Efficient RDMA-based multi-port collectives on multi-rail QsNetII clusters

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Titanium performance and potential: an NPB experimental study

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Co-Array collectives: refined semantics for co-array fortran

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part II

Architecture of the Component Collective Messaging Interface

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Parallel Numerical Library for UPC

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
UPC performance evaluation on a multicore system

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Evaluation of UPC programmability using classroom studies

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Unified parallel C for GPU clusters: language extensions and compiler implementation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Hybrid PGAS runtime support for multicore nodes

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Communication-optimal Parallel and Sequential QR and LU Factorizations

SIAM Journal on Scientific Computing
Aspen: a domain specific language for performance modeling

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic communication coalescing for irregular computations in UPC language

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The next generations of supercomputers are projected to have hundreds of thousands of processors. However, as the numbers of processors grow, the scalability of applications will be the dominant challenge. This forces us to reexamine some of our fundamental ways that we approach the design and use of parallel languages and runtime systems. In this paper we show how the globally shared arrays in a popular Partitioned Global Address Space (PGAS) language, Unified Parallel C (UPC), can be combined with a new collective interface to improve both performance and scalability. This interface allows subsets, or teams, of threads to perform a collective together. As opposed to MPI's communicators, our interface allows set of threads to be placed in teams instantly rather than explicitly constructing communicators, thus allowing for a more dynamic team construction and manipulation. We motivate our ideas with three application kernels: Dense Matrix Multiplication, Dense Cholesky factorization and multidimensional Fourier transforms. We describe how the three aforementioned applications can be succinctly written in UPC thereby aiding productivity. We also show how such an interface allows for scalability by running on up to 16,384 processors on the Blue-Gene/L. In a few lines of UPC code, we wrote a dense matrix multiply routine achieves 28.8 TFlop/s and a 3D FFT that achieves 2.1 TFlop/s. We analyze our performance results through models and show that the machine resources rather than the interfaces themselves limit the performance.