An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Authors:
Xin Huo;Vignesh Ravi;Wenjing Ma;Gagan Agrawal
Affiliations:
The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 28
Cited 1

Efficient support for irregular applications on distributed-memory machines

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Index array flattening through program transformation

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Interprocedural data flow based optimizations for distributed memory compilation

Software—Practice & Experience
Improving cache performance in dynamic applications through data and computation reorganization at run time

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications

ICS '99 Proceedings of the 13th international conference on Supercomputing
Adaptive reduction parallelization techniques

Proceedings of the 14th international conference on Supercomputing
A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors

Proceedings of the 14th international conference on Supercomputing
Partitioning Unstructured Computational Graphs for Nonuniform and Adaptive Environments

IEEE Parallel & Distributed Technology: Systems & Technology
Parallelizing Molecular Dynamics Programs for Distributed-Memory Machines

IEEE Computational Science & Engineering
Parallel Programming with Polaris

Computer
Maximizing Multiprocessor Performance with the SUIF Compiler

Computer
Distributed Memory Compiler Design For Sparse Problems

IEEE Transactions on Computers
Compiling Global Name-Space Parallel Loops for Distributed Execution

IEEE Transactions on Parallel and Distributed Systems
Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions

IEEE Transactions on Parallel and Distributed Systems
Exploiting spatial regularity in irregular iterative applications

IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
On the Automatic Parallelization of Sparse and Irregular Fortran Programs

LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Comparison of Locality Transformations for Irregular Codes

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Localizing Non-Affine Array References

PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Sparse matrix computations on manycore GPU's

Proceedings of the 45th annual Design Automation Conference
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A translation system for enabling data mining applications on GPUs

Proceedings of the 23rd international conference on Supercomputing
A framework for efficient and scalable execution of domain-specific templates on GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Molecular dynamics simulations on commodity GPUs with CUDA

HiPC'07 Proceedings of the 14th international conference on High performance computing
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

GPUs have rapidly emerged as a very significant player in high performance computing. However, despite the popularity of CUDA, there are significant challenges in porting different classes of HPC applications on modern GPUs. This paper focuses on the challenges of implementing irregular applications arising from unstructured grids on modern NVIDIA GPUs. Considering the importance of irregular reductions in scientific and engineering codes, substantial effort was made in developing compiler and runtime support for parallelization or optimization of these codes in the previous two decades, with different efforts targeting distributed memory machines, distributed shared memory machines, shared memory machines, or cache performance improvement on uniprocessor machines. However, there have not been any systematic studies on parallelizing these applications on modern GPUs. There are at least two significant challenges associated with porting this class of applications on modern GPUs. The first is related to correct and efficient parallelization while using a large number of threads. The second challenge is effective use of shared memory. Since data accesses cannot be determined statically, runtime partitioning methods are needed for effectively using the shared memory. This paper describes an execution methodology that can address the above two challenges. We have also developed optimized runtime modules to support our execution methodology. Our approach and runtime methods have been extensively evaluated using two indirection array based applications.