Comparison of two-dimensional FFT methods on the hypercube
C3P Proceedings of the third conference on Hypercube concurrent computers and applications - Volume 2
A method for exploiting communication/computation overlap in hypercubes
Parallel Computing
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
Automatically tuned collective communications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A comparison of optimal FFTs on torus and hypercube multicomputers
Parallel Computing
A high performance parallel algorithm for 1-D FFT
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
UPC performance and potential: a NPB experimental study
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Parallel FFT on ATM-based Networks of Workstations
HPDC '97 Proceedings of the 6th IEEE International Symposium on High Performance Distributed Computing
A New DMA Registration Strategy for Pinning-Based High Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Titanium Language Reference Manual
Titanium Language Reference Manual
GASNet Specification, v1.1
Evaluating support for global address space languages on the Cray X1
Proceedings of the 18th annual international conference on Supercomputing
Scientific Computations on Modern Parallel Vector Systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Automatic generation and tuning of MPI collective communication routines
Proceedings of the 19th annual international conference on Supercomputing
Transformations to Parallel Codes for Communication-Computation Overlap
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
High performance RDMA-based MPI implementation over infiniBand
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Titanium performance and potential: an NPB experimental study
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
Message strip-mining heuristics for high speed networks
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimizing communication overlap for high-speed networks
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic nonblocking communication for partitioned global address space programs
Proceedings of the 21st annual international conference on Supercomputing
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Parallel Languages and Compilers: Perspective From the Titanium Experience
International Journal of High Performance Computing Applications
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
An adaptive mesh refinement benchmark for modern parallel programming languages
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance portable optimizations for loops containing communication operations
Proceedings of the 22nd annual international conference on Supercomputing
Runtime optimization of vector operations on large scale SMP clusters
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Automatic Transformation for Overlapping Communication and Computation
NPC '08 Proceedings of the IFIP International Conference on Network and Parallel Computing
MPI-aware compiler optimizations for improving communication-computation overlap
Proceedings of the 23rd international conference on Supercomputing
GASP! a standardized performance analysis tool interface for global address space programming models
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Optimizing collective communication on multicores
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Scaling scientific applications on clusters of hybrid multicore/GPU nodes
Proceedings of the 8th ACM International Conference on Computing Frontiers
Asynchronous PGAS runtime for Myrinet networks
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Unifying UPC and MPI runtimes: experience with MVAPICH
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the communication complexity of 3D FFTs and its implications for Exascale
Proceedings of the 26th ACM international conference on Supercomputing
Productivity and Performance of Global-View Programming with XcalableMP PGAS Language
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Audit: A new synchronization API for the GET/PUT protocol
Journal of Parallel and Distributed Computing
ACM SIGMETRICS Performance Evaluation Review
UPCBLAS: a library for parallel matrix computations in Unified Parallel C
Concurrency and Computation: Practice & Experience
Bamboo: translating MPI applications to a latency-tolerant, data-driven form
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Communication avoiding and overlapping for numerical linear algebra
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A high-productivity task-based programming model for clusters
Concurrency and Computation: Practice & Experience
Improving MPI communication overlap with collaborative polling
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Performance evaluation of sparse matrix products in UPC
The Journal of Supercomputing
Enabling highly-scalable remote memory access programming with MPI-3 one sided
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing and auto-tuning parallel 3-D FFT for computation-communication overlap
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
This paper demonstrates the one-sided communication used in languages like UPC can provide a significant performance advantage for bandwidth-limited applications. This is shown through communication microbenchmarks and a case-study of UPC and MPI implementations of the NAS FT benchmark. Our optimizations rely on aggressively overlapping communication with computation, alleviating bottlenecks that typically occur when communication is isolated in a single phase. The new algorithms send more and smaller messages, yet the one-sided versions achieve 1.9× speedup over the base Fortran/MPI. Our one-sided versions show an average 15% improvement over the twosided versions, due to the lower software overhead of onesided communication, whose semantics are fundamentally lighter-weight than message passing. Our UPC results use Berkeley UPC with GASNet and demonstrate the scalability of that system, with performance approaching 0.5 TFlop/s on the FT benchmark with 512 processors.