OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Authors:
Seyong Lee;Seung-Jai Min;Rudolf Eigenmann
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2009

Citing 12
Cited 96

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimizing OpenMP programs on software distributed shared memory systems

International Journal of Parallel Programming - Special issue: OpenMP: Experiences and implementations
An integrated simdization framework using virtual vectors

Proceedings of the 19th annual international conference on Supercomputing
Towards automatic translation of OpenMP to MPI

Proceedings of the 19th annual international conference on Supercomputing
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Optimizing irregular shared-memory applications for clusters

Proceedings of the 22nd annual international conference on Supercomputing
Supporting OpenMP on cell

International Journal of Parallel Programming
A framework for efficient and scalable execution of domain-specific templates on GPUs

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)

A translation system for enabling data mining applications on GPUs

Proceedings of the 23rd international conference on Supercomputing
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
A Skeletal Parallel Framework with Fusion Optimizer for GPGPU Programming

APLAS '09 Proceedings of the 7th Asian Symposium on Programming Languages and Systems
Compiling Python to a hybrid execution environment

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Implementing the PGI Accelerator model

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Accelerating SQL database operations on a GPU with CUDA

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
A GPGPU compiler for memory optimization and parallelism management

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
JCudaMP: OpenMP/Java on CUDA

Proceedings of the 3rd International Workshop on Multicore Software Engineering
Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping

Proceedings of the 24th ACM International Conference on Supercomputing
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

Proceedings of the 24th ACM International Conference on Supercomputing
Run-time optimizations for replicated dataflows on heterogeneous environments

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
MapCG: writing parallel program portable between CPU and GPU

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Gossamer: a lightweight programming framework for multicore machines

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Extending abstract GPU APIs to shared memory

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
memCUDA: map device memory to host memory on GPGPU platform

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Compiler-directed memory management for heterogeneous MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
Breaking the GPU programming barrier with the auto-parallelising SAC compiler

Proceedings of the sixth workshop on Declarative aspects of multicore programming
Copperhead: compiling an embedded data parallel language

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
GRace: a low-overhead mechanism for detecting data races in GPU programs

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Cost-aware function migration in heterogeneous systems

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A programming language interface to describe transformations and code generation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unified parallel C for GPU clusters: language extensions and compiler implementation

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
CnC-CUDA: declarative programming for GPUs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
A data parallel view on polyhedral process networks

Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Proceedings of the international conference on Supercomputing
Processing data streams with hard real-time constraints on heterogeneous systems

Proceedings of the international conference on Supercomputing
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Automating GPU computing in MATLAB

Proceedings of the international conference on Supercomputing
Enabling multiple accelerator acceleration for Java/OpenMP

HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
A platform-independent tool for modeling parallel programs

Proceedings of the 49th Annual Southeast Regional Conference
Scaling scientific applications on clusters of hybrid multicore/GPU nodes

Proceedings of the 8th ACM International Conference on Computing Frontiers
Mapping data mining algorithms on a GPU architecture: a study

ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
Streaming-oriented parallelization of domain-independent irregular kernels

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
CUDACL+: a framework for GPU programs

Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
Introducing scalable quantum approaches in language representation

QI'11 Proceedings of the 5th international conference on Quantum interaction
Chestnut: a GPU programming language for non-experts

Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
GPU-based NFA implementation for memory efficient high speed regular expression matching

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Speculative parallelization on GPGPUs

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Optimization strategies in different CUDA architectures using llCoMP

Microprocessors & Microsystems
FLAT: a GPU programming framework to provide embedded MPI

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
OMPCUDA: OpenMP execution framework for CUDA based on omni OpenMP compiler

IWOMP'10 Proceedings of the 6th international conference on Beyond Loop Level Parallelism in OpenMP: accelerators, Tasking and more
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
A data-parallel extension to Ruby for GPGPU: toward a framework for implementing domain-specific optimizations

Proceedings of the 9th ECOOP Workshop on Reflection, AOP, and Meta-Data for Software Evolution
Automatic source code transformation for GPUs based on program comprehension

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation

Proceedings of the 26th ACM international conference on Supercomputing
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
Optimizing dataflow applications on heterogeneous environments

Cluster Computing
Financial software on GPUs: between Haskell and Fortran

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Towards efficient GPU sharing on multicore processors

ACM SIGMETRICS Performance Evaluation Review
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Designing a unified programming model for heterogeneous machines

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Accelerating text mining workloads in a MapReduce-based distributed GPU environment

Journal of Parallel and Distributed Computing
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Iterative statistical kernels on contemporary GPUs

International Journal of Computational Science and Engineering
Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
KFusion: optimizing data flow without compromising modularity

Proceedings of the 12th annual international conference on Aspect-oriented software development
Cache-Conscious Wavefront Scheduling

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing divergence in GPGPU programs with loop merging

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
From physics model to results: an optimizing framework for cross-architecture code generation

Proceedings of the Extreme Scaling Workshop
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Portable mapping of openMP to multicore embedded systems using MCA APIs

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Integrating acceleration devices using CometCloud

Proceedings of the first ACM workshop on Optimization techniques for resources management in clouds
Experimenting with low-overhead OpenMP runtime on IBM Blue Gene/Q

IBM Journal of Research and Development
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Exploring hybrid memory for GPU energy efficiency through software-hardware co-design

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Neither more nor less: optimizing thread-level parallelism for GPGPUs

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Trellis: Portability across architectures with a high-level framework

Journal of Parallel and Distributed Computing
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation

International Journal of Parallel Programming
The Implementation of a High Performance GPGPU Compiler

International Journal of Parallel Programming
An application-centric evaluation of OpenCL on multi-core CPUs

Parallel Computing
A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters

The Journal of Supercomputing
From physics model to results: An optimizing framework for cross-architecture code generation

Scientific Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

GPGPUs have recently emerged as powerful vehicles for general-purpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial).