OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Authors:
Seyong Lee;Rudolf Eigenmann
Affiliations:
-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 14
Cited 38

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
PEAK—a fast and effective performance tuning system via compiler optimization orchestration

ACM Transactions on Programming Languages and Systems (TOPLAS)
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CUDA-Lite: Reducing GPU Programming Complexity

Languages and Compilers for Parallel Computing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
hiCUDA: a high-level directive-based language for GPU programming

Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
A cross-input adaptive framework for GPU program optimizations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Auto-tuning 3-D FFT library for CUDA GPUs

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Cetus: A Source-to-Source Compiler Infrastructure for Multicores

Computer
The university of Florida sparse matrix collection

ACM Transactions on Mathematical Software (TOMS)
Automatic C-to-CUDA code generation for affine programs

CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction

An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs

Proceedings of the international conference on Supercomputing
Mint: realizing CUDA performance in 3D stencil methods with annotated C

Proceedings of the international conference on Supercomputing
Automating GPU computing in MATLAB

Proceedings of the international conference on Supercomputing
Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Proceedings of the 20th international symposium on High performance distributed computing
OpenMP for accelerators

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Performance analysis and tuning of automatically parallelized OpenMP applications

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
A distributed data-parallel framework for analysis and visualization algorithm development

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
GA-GPU: extending a library-based global address spaceprogramming model for scalable heterogeneouscomputing systems

Proceedings of the 9th conference on Computing Frontiers
Parameterized micro-benchmarking: an auto-tuning approach for complex applications

Proceedings of the 9th conference on Computing Frontiers
An extension of XcalableMP PGAS lanaguage for multi-node GPU clusters

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A systematic process for efficient execution on Intel's heterogeneous computation nodes

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Effects of compiler optimizations in OpenMP to CUDA translation

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Early evaluation of directive-based GPU programming models for productive exascale computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Automatic generation of software pipelines for heterogeneous parallel systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ValuePack: value-based scheduling framework for CPU-GPU clusters

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenACC: first experiences with real-world applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Iterative statistical kernels on contemporary GPUs

International Journal of Computational Science and Engineering
libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Input-aware auto-tuning for directive-based GPU programming

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Scaling large-data computations on multi-GPU accelerators

Proceedings of the 27th international ACM conference on International conference on supercomputing
Portable mapping of openMP to multicore embedded systems using MCA APIs

Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
General transformations for GPU execution of tree traversals

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scheduling concurrent applications on a cluster of CPU-GPU nodes

Future Generation Computer Systems
Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

Journal of Parallel and Distributed Computing
Towards making autotuning mainstream

International Journal of High Performance Computing Applications
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs
A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

General-Purpose Graphics Processing Units (GPGPUs) are promising parallel platforms for high performance computing. The CUDA (Compute Unified Device Architecture) programming model provides improved programmability for general computing on GPGPUs. However, its unique execution model and memory model still pose significant challenges for developers of efficient GPGPU code. This paper proposes a new programming interface, called OpenMPC, which builds on OpenMP to provide an abstraction of the complex CUDA programming model and offers high-level controls of the involved parameters and optimizations. We have developed a fully automatic compilation and user-assisted tuning system supporting OpenMPC. In addition to a range of compiler transformations and optimizations, the system includes tuning capabilities for generating, pruning, and navigating the search space of compilation variants. Our results demonstrate that OpenMPC offers both programmability and tunability. Our system achieves 88% of the performance of the hand-coded CUDA programs.