Executing a Program on the MIT Tagged-Token Dataflow Architecture
IEEE Transactions on Computers
A bridging model for parallel computation
Communications of the ACM
CHARM++: a portable concurrent object oriented system based on C++
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
An annotation language for optimizing software libraries
Proceedings of the 2nd conference on Domain-specific languages
Introduction to algorithms
Distributed processing of very large datasets with DataCutter
Parallel Computing - Clusters and computational grids for scientific computing
Treating a User-Defined Parallel Library as a Domain-Specific Language
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
KelpIO: A Telescope-Ready Domain-Specific I/O Library for Irregular Block-Structured Applications
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
GASNet Specification, v1.1
Transformations to Parallel Codes for Communication-Computation Overlap
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Asynchronous programming with Tarragon
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Computer
Multi-threading and one-sided communication in parallel LU factorization
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using MPI Communication Patterns to Guide Source Code Transformations
ICCS '08 Proceedings of the 8th international conference on Computational Science, Part III
Communication-Sensitive Static Dataflow for Parallel Message Passing Applications
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Hiding Communication Latency with Non-SPMD, Graph-Based Execution
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Overlapping communication and computation by using a hybrid MPI/SMPSs approach
Proceedings of the 24th ACM International Conference on Supercomputing
The general matrix multiply-add operation on 2D torus
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Formal analysis of MPI-based parallel programs
Communications of the ACM
Tarragon: a programming model for latency-hiding scientific computations
Tarragon: a programming model for latency-hiding scientific computations
Latency Hiding and Performance Tuning with Graph-Based Execution
DFM '11 Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing
Hi-index | 0.00 |
We present Bamboo, a custom source-to-source translator that transforms MPI C source into a data-driven form that automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe that Bamboo's overlap capability speeds up MPI implementations of a 3D Jacobi iterative solver and Cannon's matrix multiplication. Bamboo's generated code meets or exceeds the performance of hand optimized MPI, which includes split-phase coding, the method classically employed to hide communication. We achieved our results with only modest amounts of programmer annotation and no intrusive reprogramming of the original application source.