Bamboo: translating MPI applications to a latency-tolerant, data-driven form

Authors:
Tan Nguyen;Pietro Cicotti;Eric Bylaska;Dan Quinlan;Scott B. Baden
Affiliations:
University of California, San Diego, La Jolla, CA;University of California, San Diego, La Jolla, CA;Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA;Center for Advanced Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA;University of California, San Diego, La Jolla, CA
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 23
Cited 0

Executing a Program on the MIT Tagged-Token Dataflow Architecture

IEEE Transactions on Computers
A bridging model for parallel computation

Communications of the ACM
CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
An annotation language for optimizing software libraries

Proceedings of the 2nd conference on Domain-specific languages
Introduction to algorithms

Introduction to algorithms
Distributed processing of very large datasets with DataCutter

Parallel Computing - Clusters and computational grids for scientific computing
Treating a User-Defined Parallel Library as a Domain-Specific Language

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
KelpIO: A Telescope-Ready Domain-Specific I/O Library for Irregular Block-Structured Applications

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
GASNet Specification, v1.1

GASNet Specification, v1.1
Transformations to Parallel Codes for Communication-Computation Overlap

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Asynchronous programming with Tarragon

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Parallel Processing with Large-Grain Data Flow Techniques

Computer
Data Flow Supercomputers

Computer
Multi-threading and one-sided communication in parallel LU factorization

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using MPI Communication Patterns to Guide Source Code Transformations

ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Communication-Sensitive Static Dataflow for Parallel Message Passing Applications

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Hiding Communication Latency with Non-SPMD, Graph-Based Execution

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Proceedings of the 24th ACM International Conference on Supercomputing
The general matrix multiply-add operation on 2D torus

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Formal analysis of MPI-based parallel programs

Communications of the ACM
Tarragon: a programming model for latency-hiding scientific computations

Tarragon: a programming model for latency-hiding scientific computations
Latency Hiding and Performance Tuning with Graph-Based Execution

DFM '11 Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Bamboo, a custom source-to-source translator that transforms MPI C source into a data-driven form that automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe that Bamboo's overlap capability speeds up MPI implementations of a 3D Jacobi iterative solver and Cannon's matrix multiplication. Bamboo's generated code meets or exceeds the performance of hand optimized MPI, which includes split-phase coding, the method classically employed to hide communication. We achieved our results with only modest amounts of programmer annotation and no intrusive reprogramming of the original application source.