Efficient support for irregular applications on distributed-memory machines
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Index array flattening through program transformation
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Interprocedural data flow based optimizations for distributed memory compilation
Software—Practice & Experience
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Improving memory hierarchy performance for irregular applications
ICS '99 Proceedings of the 13th international conference on Supercomputing
Adaptive reduction parallelization techniques
Proceedings of the 14th international conference on Supercomputing
Proceedings of the 14th international conference on Supercomputing
Partitioning Unstructured Computational Graphs for Nonuniform and Adaptive Environments
IEEE Parallel & Distributed Technology: Systems & Technology
Parallelizing Molecular Dynamics Programs for Distributed-Memory Machines
IEEE Computational Science & Engineering
Parallel Programming with Polaris
Computer
Distributed Memory Compiler Design For Sparse Problems
IEEE Transactions on Computers
Compiling Global Name-Space Parallel Loops for Distributed Execution
IEEE Transactions on Parallel and Distributed Systems
Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions
IEEE Transactions on Parallel and Distributed Systems
Exploiting spatial regularity in irregular iterative applications
IPPS '95 Proceedings of the 9th International Symposium on Parallel Processing
Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes
LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
On the Automatic Parallelization of Sparse and Irregular Fortran Programs
LCR '98 Selected Papers from the 4th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Comparison of Locality Transformations for Irregular Codes
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
Localizing Non-Affine Array References
PACT '99 Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Sparse matrix computations on manycore GPU's
Proceedings of the 45th annual Design Automation Conference
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A translation system for enabling data mining applications on GPUs
Proceedings of the 23rd international conference on Supercomputing
A framework for efficient and scalable execution of domain-specific templates on GPUs
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Model-driven autotuning of sparse matrix-vector multiply on GPUs
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Molecular dynamics simulations on commodity GPUs with CUDA
HiPC'07 Proceedings of the 14th international conference on High performance computing
A GPGPU compiler for memory optimization and parallelism management
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
GPUs have rapidly emerged as a very significant player in high performance computing. However, despite the popularity of CUDA, there are significant challenges in porting different classes of HPC applications on modern GPUs. This paper focuses on the challenges of implementing irregular applications arising from unstructured grids on modern NVIDIA GPUs. Considering the importance of irregular reductions in scientific and engineering codes, substantial effort was made in developing compiler and runtime support for parallelization or optimization of these codes in the previous two decades, with different efforts targeting distributed memory machines, distributed shared memory machines, shared memory machines, or cache performance improvement on uniprocessor machines. However, there have not been any systematic studies on parallelizing these applications on modern GPUs. There are at least two significant challenges associated with porting this class of applications on modern GPUs. The first is related to correct and efficient parallelization while using a large number of threads. The second challenge is effective use of shared memory. Since data accesses cannot be determined statically, runtime partitioning methods are needed for effectively using the shared memory. This paper describes an execution methodology that can address the above two challenges. We have also developed optimized runtime modules to support our execution methodology. Our approach and runtime methods have been extensively evaluated using two indirection array based applications.