Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

Authors:
José Carlos Sancho;Kevin J. Barker;Darren J. Kerbyson;Kei Davis
Affiliations:
Performance and Architecture Laboratory (PAL), Los Alamos National Laboratory, NM;Performance and Architecture Laboratory (PAL), Los Alamos National Laboratory, NM;Performance and Architecture Laboratory (PAL), Los Alamos National Laboratory, NM;Performance and Architecture Laboratory (PAL), Los Alamos National Laboratory, NM
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 18
Cited 12

Modeling of parallel software for efficient computation communication overlap

ACM '87 Proceedings of the 1987 Fall Joint Computer Conference on Exploring technology: today and tomorrow
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
On the utility of communication-computation overlap in data-parallel programs

Journal of Parallel and Distributed Computing
Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Pipelined Data Parallel Algorithms-II: Design

IEEE Transactions on Parallel and Distributed Systems
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Identifying the Capability of Overlapping Computation with Communication

PACT '96 Proceedings of the 1996 Conference on Parallel Architectures and Compilation Techniques
Performance Evaluation of the Cray X1 Distributed Shared-Memory Architecture

IEEE Micro
Practical performance portability in the Parallel Ocean Program (POP): Research Articles

Concurrency and Computation: Practice & Experience - The High Performance Architectural Challenge: Mass Market versus Proprietary Components?
Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications

International Journal of High Performance Computing Applications
Communication Optimizations for Fine-Grained UPC Applications

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
HUNTing the Overlap

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A Performance Model of the Parallel Ocean Program

International Journal of High Performance Computing Applications
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Tolerating message latency through the early release of blocked receives

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO

ACM SIGOPS Operating Systems Review
MPI-aware compiler optimizations for improving communication-computation overlap

Proceedings of the 23rd international conference on Supercomputing
Subdomain communication to increase scalability in large-scale scientific applications

Proceedings of the 23rd international conference on Supercomputing
Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application

Proceedings of the 24th ACM International Conference on Supercomputing
Light-weight communications on Intel's single-chip cloud computer processor

ACM SIGOPS Operating Systems Review
Making time-stepped applications tick in the cloud

Proceedings of the 2nd ACM Symposium on Cloud Computing
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Delta Send-Recv for Dynamic Pipelining in MPI Programs

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Improving reactivity and communication overlap in MPI using a generic I/O manager

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Parallel job scheduling for power constrained HPC systems

Parallel Computing
MPI and compiler technology: a love-hate relationship

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The design and implementation of a high performance communication network are critical factors in determining the performance and cost-effectiveness of a largescale computing system. The major issues center on the trade-off between the network cost and the impact of latency and bandwidth on application performance. One promising technique for extracting maximum application performance given limited network resources is based on overlapping computation with communication, which partially or entirely hides communication delays. While this approach is not new, there are few studies that quantify the potential benefit of such overlapping for large-scale production scientific codes. We address this with an empirical method combined with a network model to quantify the potential overlap in several codes and examine the possible performance benefit. Our results demonstrate, for the codes examined, that a high potential tolerance to network latency and bandwidth exists because of a high degree of potential overlap. Moreover, our results indicate that there is often no need to use finegrained communication mechanisms to achieve this benefit, since the major source of potential overlap is found in independent work--computation on which pending messages does not depend. This allows for a potentially significant relaxation of network requirements without a consequent degradation of application performance.