Automatic C-to-CUDA code generation for affine programs

Authors:
Muthu Manikandan Baskaran;J. Ramanujam;P. Sadayappan
Affiliations:
The Ohio State University;Louisiana State University;The Ohio State University
Venue:
CC'10/ETAPS'10 Proceedings of the 19th joint European conference on Theory and Practice of Software, international conference on Compiler Construction
Year:
2010

Citing 20
Cited 43

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A practical algorithm for exact array dependence analysis

Communications of the ACM
Some efficient solutions to the affine scheduling problem: I. One-dimensional time

International Journal of Parallel Programming
Generation of Efficient Nested Loops from Polyhedra

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Automatic Parallelization in the Polytope Model

The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
Improving parallelism and data locality with affine partitioning

Improving parallelism and data locality with affine partitioning
Code Generation in the Polyhedral Model Is Easier Than You Think

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication

Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Violated dependence analysis

Proceedings of the 20th annual international conference on Supercomputing
A memory model for scientific algorithms on graphics processors

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Proceedings of the International Symposium on Code Generation and Optimization
Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus

Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A cross-input adaptive framework for GPU program optimizations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction

A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
OpenMPC: Extended OpenMP Programming and Tuning for GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Reducing branch divergence in GPU programs

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A data parallel view on polyhedral process networks

Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems
Automatic CPU-GPU communication management and optimization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automating GPU computing in MATLAB

Proceedings of the international conference on Supercomputing
CuMAPz: a tool to analyze memory access patterns in CUDA

Proceedings of the 48th Design Automation Conference
Streaming-oriented parallelization of domain-independent irregular kernels

Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Model-driven tile size selection for DOACROSS loops on GPUs

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks

ACM SIGARCH Computer Architecture News
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Paragon: collaborative speculative loop execution on GPU and CPU

Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Parallelizing SOR for GPGPUs using alternate loop tiling

Parallel Computing
Adaptive input-aware compilation for graphics engines

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors

Proceedings of the 26th ACM international conference on Supercomputing
A systematic process for efficient execution on Intel's heterogeneous computation nodes

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
Financial software on GPUs: between Haskell and Fortran

Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Code generation for parallel execution of a class of irregular loops on distributed memory systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs

International Journal of Computational Science and Engineering
Iterative statistical kernels on contemporary GPUs

International Journal of Computational Science and Engineering
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Portable performance on heterogeneous architectures

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
KFusion: optimizing data flow without compromising modularity

Proceedings of the 12th annual international conference on Aspect-oriented software development
Split tiling for GPUs: automatic parallelization using trapezoidal tiles

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Input-aware auto-tuning for directive-based GPU programming

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Atomic-free irregular computations on GPUs

Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Orchestrated scheduling and prefetching for GPGPUs

Proceedings of the 40th Annual International Symposium on Computer Architecture
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory performance estimation of CUDA programs

ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Automatic data allocation and buffer management for multi-GPU machines

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient Mapping of Irregular C++ Applications to Integrated GPUs

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
MobileFBP: Designing portable reconfigurable applications for heterogeneous systems

Journal of Systems Architecture: the EUROMICRO Journal
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs
Leveraging GPUs using cooperative loop speculation

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierarchy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest. This paper describes an automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs. Using and adapting publicly available tools that have made polyhedral compiler optimization practically effective, we develop a C-to-CUDA transformation system that generates two-level parallel CUDA code that is optimized for efficient data access. The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. The performance of the automatically generated CUDA code is quite close to hand-optimized CUDA code and considerably better than the benchmarks' performance on a multicore CPU.