Optimizing irregular shared-memory applications for distributed-memory systems

Authors:
Ayon Basumallik;Rudolf Eigenmann
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2006

Citing 23
Cited 11

Loop distribution with arbitrary control flow

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Array privatization for shared and distributed memory machines (extended abstract)

ACM SIGPLAN Notices - Workshop on languages, compilers and run-time environments for distributed memory multiprocessors
Techniques to overlap computation and communication in irregular iterative applications

ICS '94 Proceedings of the 8th international conference on Supercomputing
Communication optimizations for irregular scientific computations on distributed memory architectures

Journal of Parallel and Distributed Computing - Special issue on scalability of parallel algorithms and architectures
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
An HPF compiler for the IBM SP2

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Data-localization for Fortran macro-dataflow computation using partial static task assignment

ICS '96 Proceedings of the 10th international conference on Supercomputing
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Compiler and software distributed shared memory support for irregular applications

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
A Loop Transformation Algorithm for Communication Overlapping

International Journal of Parallel Programming - Special issue on international symposium on high performance computing 1997, part I
Enabling unimodular transformations

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Distributed Memory Compiler Design For Sparse Problems

IEEE Transactions on Computers
An Implementation of Interprocedural Bounded Regular Section Analysis

IEEE Transactions on Parallel and Distributed Systems
Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Overlap of computation and communication on shared-memory networks-of-workstations

Cluster computing
Localizing Non-Affine Array References

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Titanium Language Reference Manual

Titanium Language Reference Manual
MPI: A Message-Passing Interface Standard

MPI: A Message-Passing Interface Standard
Optimizing OpenMP programs on software distributed shared memory systems

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
Automatic Support for Irregular Computations in a High-Level Language

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Towards automatic translation of OpenMP to MPI

Proceedings of the 19th annual international conference on Supercomputing

Optimizing irregular shared-memory applications for clusters

Proceedings of the 22nd annual international conference on Supercomputing
OpenMP Extensions for Irregular Parallel Applications on Clusters

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
Automatic Transformation for Overlapping Communication and Computation

NPC '08 Proceedings of the IFIP International Conference on Network and Parallel Computing
Transforming the adaptive irregular out-of-core applications for hiding communication and disk I/O

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Sublimation: expanding data structures to enable data instance specific optimizations

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Safe parallel programming using dynamic dependence hints

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Code generation for parallel execution of a class of irregular loops on distributed memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Non-affine Extensions to Polyhedral Code Generation

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this translation scheme is limited by the fact that accesses to shared-data cannot be accurately resolved at compile-time. Additionally, irregular applications with high communication to computation ratios pose challenges even for direct implementation on message passing systems. In this paper, we present combined compile-time/run-time techniques for optimizing irregular shared-memory applications on message passing systems in the context of automatic translation from OpenMP to MPI. Our transformations enable computation-communication overlap by restructuring irregular parallel loops. The compiler creates inspectors to analyze actual data access patterns for irregular accesses at runtime. This analysis is combined with the compile-time analysis of regular data accesses to determine which iterations of irregular loops access non-local data. The iterations are then reordered to enable computation-communication overlap. In the case where the irregular access occurs inside nested loops, the loop nest is restructured. We evaluate our techniques by translating OpenMP versions of three benchmarks from two important classes of irregular applications - sparse matrix computations and molecular dynamics. We find that for these applications, on sixteen nodes, versions employing computation-communication overlap are almost twice as fast as baseline OpenMP-to-MPI versions, almost 30% faster than inspector-only versions, almost 25% faster than hand-coded versions on two applications and about 9% slower on the third.