Compiler and Runtime Support for Irregular Reductions on a Multithreaded Architecture

Authors:
Gary M. Zoppetti;Gagan Agrawal;Rishi Kumar
Affiliations:
-;-;-
Venue:
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Year:
2002

Citing 24
Cited 3

Run-time scheduling and execution of loops on message passing machines

Journal of Parallel and Distributed Computing - Special issue: algorithms for hypercube computers
The high performance Fortran handbook

The high performance Fortran handbook
Context-sensitive interprocedural points-to analysis in the presence of function pointers

PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
Efficient support for irregular applications on distributed-memory machines

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Interprocedural compilation of irregular applications for distributed memory machines

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Index array flattening through program transformation

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
A study of the EARTH-MANNA multithreaded system

International Journal of Parallel Programming - Special issue on parallel architectures and compilation techniques—part II
Interprocedural data flow based optimizations for distributed memory compilation

Software—Practice & Experience
Compiling C for the EARTH multithreaded architecture

International Journal of Parallel Programming - Special issue: selected papers from PACT'96, fourth international conference on parallel architectures and compilation techniques—part 1
Communication optimizations for parallel C programs

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors

Proceedings of the 14th international conference on Supercomputing
Automatic compiler techniques for thread coarsening for multithreaded architectures

Proceedings of the 14th international conference on Supercomputing
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Parallelizing Molecular Dynamics Programs for Distributed-Memory Machines

IEEE Computational Science & Engineering
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Distributed Memory Compiler Design For Sparse Problems

IEEE Transactions on Computers
Compiling Global Name-Space Parallel Loops for Distributed Execution

IEEE Transactions on Parallel and Distributed Systems
Exploiting spatial regularity in irregular iterative applications

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Developing a Communication Intensive Application on the EARTH Multithreaded Architecture (Distinguished Paper)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
On the Automatic Parallelization of Sparse and Irregular Fortran Programs

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Comparison of Locality Transformations for Irregular Codes

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Heap Analysis And Optimizations For Threaded Programs

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques

Compiling Several Classes of Communication Patterns on a Multithreaded Architecture

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
An Adaptive Algorithm Selection Framework for Reduction Parallelization

IEEE Transactions on Parallel and Distributed Systems
An analytical model of locality-based parallel irregular reductions

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computations from many scientific and engineering domains use irregular meshes and/or sparse matrices. The codes expressing these computations involve irregular reductions. Irregular reductions pose many challenges to parallel architectures and their compilers in terms of parallelization, locality management, and communication optimization.Multithreaded architectures offer rich support for local synchronization, overlapping of communication and computation, and low-overhead communication and thread switching. Therefore, they appear to be promising for scalable parallelization of irregular reductions. This paper presents an execution model and a compilation strategy for supporting irregular reductions on a fme-grained multithreaded architecture. The key aspect of this strategy is that the frequency and volume of communication is independent of the contents of the indirection arrays. The performance obtained depends upon the architecture's ability to overlap communication and computation and is largely independent of the partitioning of the problem.We present experimental results from compiling three scientific kernels involving irregular reductions (mvm, euler, and moldyn) for execution on the EARTH fine-grained multithreaded architecture. On mvm, which does not involve any left-hand-side irregular accesses, we achieve near linear absolute speedups. For euler and moldyn, which do involve left-hand-side irregular accesses, our strategy initially incurs some overheads, but the relative speedups are very good. In going from 2 to 32 processors, the relative speedups for euler were 9.28 and 10.36 on its two datasets, while the speedups for moldyn were 9.70 and 10.76 on its two datasets.