Optimizing bandwidth limited problems using one-sided communication and overlap

Authors:
Christian Bell;Dan Bonachea;Rajesh Nishtala;Katherine Yelick
Affiliations:
University of California, Berkeley, Berkeley, CA and Lawrence Berkeley National Laboratory, Berkeley, CA;University of California, Berkeley, Berkeley, CA;University of California, Berkeley, Berkeley, CA;University of California, Berkeley, Berkeley, CA and Lawrence Berkeley National Laboratory, Berkeley, CA
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 22
Cited 34

Comparison of two-dimensional FFT methods on the hypercube

C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
A method for exploiting communication/computation overlap in hypercubes

Parallel Computing
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A comparison of optimal FFTs on torus and hypercube multicomputers

Parallel Computing
A high performance parallel algorithm for 1-D FFT

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Parallel FFT on ATM-based Networks of Workstations

HPDC '97 Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing
A New DMA Registration Strategy for Pinning-Based High Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Titanium Language Reference Manual

Titanium Language Reference Manual
GASNet Specification, v1.1

GASNet Specification, v1.1
Evaluating support for global address space languages on the Cray X1

Proceedings of the 18th annual international conference on Supercomputing
Scientific Computations on Modern Parallel Vector Systems

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
QsNetII: Defining High-Performance Network Design

IEEE Micro
Automatic generation and tuning of MPI collective communication routines

Proceedings of the 19th annual international conference on Supercomputing
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
High performance RDMA-based MPI implementation over infiniBand

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Titanium performance and potential: an NPB experimental study

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Message strip-mining heuristics for high speed networks

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science

Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimizing communication overlap for high-speed networks

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic nonblocking communication for partitioned global address space programs

Proceedings of the 21st annual international conference on Supercomputing
Productivity and performance using partitioned global address space languages

Proceedings of the 2007 international workshop on Parallel symbolic computation
Parallel Languages and Compilers: Perspective From the Titanium Experience

International Journal of High Performance Computing Applications
Performance without pain = productivity: data layout and collective communication in UPC

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
An adaptive mesh refinement benchmark for modern parallel programming languages

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance portable optimizations for loops containing communication operations

Proceedings of the 22nd annual international conference on Supercomputing
Runtime optimization of vector operations on large scale SMP clusters

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Automatic Transformation for Overlapping Communication and Computation

NPC '08 Proceedings of the IFIP International Conference on Network and Parallel Computing
MPI-aware compiler optimizations for improving communication-computation overlap

Proceedings of the 23rd international conference on Supercomputing
GASP! a standardized performance analysis tool interface for global address space programming models

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Optimizing collective communication on multicores

HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Proceedings of the 8th ACM International Conference on Computing Frontiers
Asynchronous PGAS runtime for Myrinet networks

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Unifying UPC and MPI runtimes: experience with MVAPICH

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Multithreaded Global Address Space Communication Techniques for Gyrokinetic Fusion Applications on Ultra-Scale Platforms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Tuning collective communication for Partitioned Global Address Space programming models

Parallel Computing
On the communication complexity of 3D FFTs and its implications for Exascale

Proceedings of the 26th ACM international conference on Supercomputing
Productivity and Performance of Global-View Programming with XcalableMP PGAS Language

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Audit: A new synchronization API for the GET/PUT protocol

Journal of Parallel and Distributed Computing
A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI

ACM SIGMETRICS Performance Evaluation Review
UPCBLAS: a library for parallel matrix computations in Unified Parallel C

Concurrency and Computation: Practice & Experience
Bamboo: translating MPI applications to a latency-tolerant, data-driven form

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Communication avoiding and overlapping for numerical linear algebra

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A high-productivity task-based programming model for clusters

Concurrency and Computation: Practice & Experience
Improving MPI communication overlap with collaborative polling

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Performance evaluation of sparse matrix products in UPC

The Journal of Supercomputing
Enabling highly-scalable remote memory access programming with MPI-3 one sided

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving MPI communication overlap with collaborative polling

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimizations rely on aggressively overlapping communication with computation, alleviating bottlenecks that typically occur when communication is isolated in a single phase. The new algorithms send more and smaller messages, yet the one-sided versions achieve 1.9× speedup over the base Fortran/MPI. Our one-sided versions show an average 15% improvement over the twosided versions, due to the lower software overhead of onesided communication, whose semantics are fundamentally lighter-weight than message passing. Our UPC results use Berkeley UPC with GASNet and demonstrate the scalability of that system, with performance approaching 0.5 TFlop/s on the FT benchmark with 512 processors.