HUNTing the Overlap

Authors:
Costin Iancu;Parry Husbands;Paul Hargrove
Affiliations:
Computational Research Division, Lawrence Berkeley National Laboratory;Computational Research Division, Lawrence Berkeley National Laboratory;Computational Research Division, Lawrence Berkeley National Laboratory
Venue:
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Year:
2005

Citing 18
Cited 14

Memory coherence in shared virtual memory systems

PODC '86 Proceedings of the fifth annual ACM symposium on Principles of distributed computing
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Global communication analysis and optimization

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
A Unified Framework for Optimizing Communication in Data-Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
LogGP: incorporating long messages into the LogP model for parallel computation

Journal of Parallel and Distributed Computing
Simplification of array access patterns for compiler optimizations

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
MultiView and Millipage — fine-grain sharing in page-based DSMs

OSDI '99 Proceedings of the third symposium on Operating systems design and implementation
Communication optimizations for parallel C programs

Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Minimizing Data and Synchronization Costs in One-Way Communication

IEEE Transactions on Parallel and Distributed Systems
Software Support for Virtual Memory-Mapped Communication

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
A New Approach to Array Redistribution: Strip Mining Redistribution

PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
The Design for a High-Performance MPI Implementation on the Myrinet Network

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
GASNet Specification, v1.1

GASNet Specification, v1.1
A Multi-Platform Co-Array Fortran Compiler

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Message strip-mining heuristics for high speed networks

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science

Shared memory programming for large scale machines

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On using connection-oriented vs. connection-less transport for performance and scalability of collective and one-sided operations: trade-offs and impact

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic nonblocking communication for partitioned global address space programs

Proceedings of the 21st annual international conference on Supercomputing
Towards Ultra-High Resolution Models of Climate and Weather

International Journal of High Performance Computing Applications
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Orchestrating data transfer for the cell/B.E. processor

Proceedings of the 22nd annual international conference on Supercomputing
DBDB: optimizing DMATransfer for the cell be architecture

Proceedings of the 23rd international conference on Supercomputing
Optimizing the use of static buffers for DMA on a CELL chip

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Delta Send-Recv for Dynamic Pipelining in MPI Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Automatic communication coalescing for irregular computations in UPC language

CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Netgauge: a network performance measurement framework

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Hiding communication latency is an important optimization for parallel programs. Programmers or compilers achieve this by using non-blocking communication primitives and overlapping communication with computation or other communication operations. Using non-blocking communication raises two issues: performance and programmability. In terms of performance, optimizers need to find a good communication schedule and are sometimes constrained by lack of full application knowledge. In terms of programmability, efficiently managing nonblocking communication can prove cumbersome for complex applications. In this paper we present the design principles of HUNT, a runtime system designed to search and exploit some of the available overlap present at execution time in UPC programs. Using virtual memory support, our runtime implements demand-driven synchronization for data involved in communication operations. It also employs message decomposition and scheduling heuristics to transparently improve the non-blocking behavior of applications. We provide a user level implementation of HUNT on a variety of modern high performance computing systems. Results indicate that our approach is successful in finding some of the overlap available at execution time. While system and application characteristics influence performance, perhaps the determining factor is the time taken by the CPU to execute a signal handler. Demand driven synchronization at execution time eliminates the need for the explicit management of non-blocking communication. Besides increasing programmer productivity, this feature also simpli- fies compiler analysis for communication optimizations.