POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scanning polyhedra with DO loops
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
A practical algorithm for exact array dependence analysis
Communications of the ACM
Some efficient solutions to the affine scheduling problem: I. One-dimensional time
International Journal of Parallel Programming
Generation of Efficient Nested Loops from Polyhedra
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, part 2
Automatic Parallelization in the Polytope Model
The Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications
Improving parallelism and data locality with affine partitioning
Improving parallelism and data locality with affine partitioning
Code Generation in the Polyhedral Model Is Easier Than You Think
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Understanding the efficiency of GPU algorithms for matrix-matrix multiplication
Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware
Proceedings of the 20th annual international conference on Supercomputing
A memory model for scientific algorithms on graphics processors
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Proceedings of the International Symposium on Code Generation and Optimization
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
A practical automatic polyhedral parallelizer and locality optimizer
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A data parallel view on polyhedral process networks
Proceedings of the 14th International Workshop on Software and Compilers for Embedded Systems
Automatic CPU-GPU communication management and optimization
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Automating GPU computing in MATLAB
Proceedings of the international conference on Supercomputing
CuMAPz: a tool to analyze memory access patterns in CUDA
Proceedings of the 48th Design Automation Conference
Streaming-oriented parallelization of domain-independent irregular kernels
Euro-Par 2010 Proceedings of the 2010 conference on Parallel processing
Model-driven tile size selection for DOACROSS loops on GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
GROPHECY: GPU performance projection from CPU code skeletons
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ACM SIGARCH Computer Architecture News
Introducing 'Bones': a parallelizing source-to-source compiler based on algorithmic skeletons
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Paragon: collaborative speculative loop execution on GPU and CPU
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamically managed data for CPU-GPU architectures
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors
Proceedings of the 26th ACM international conference on Supercomputing
A systematic process for efficient execution on Intel's heterogeneous computation nodes
Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Financial software on GPUs: between Haskell and Fortran
Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Code generation for parallel execution of a class of irregular loops on distributed memory systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dataflow-driven GPU performance projection for multi-kernel transformations
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Polyhedral parallel code generation for CUDA
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs
International Journal of Computational Science and Engineering
Iterative statistical kernels on contemporary GPUs
International Journal of Computational Science and Engineering
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Portable performance on heterogeneous architectures
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
KFusion: optimizing data flow without compromising modularity
Proceedings of the 12th annual international conference on Aspect-oriented software development
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Input-aware auto-tuning for directive-based GPU programming
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Atomic-free irregular computations on GPUs
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Orchestrated scheduling and prefetching for GPGPUs
Proceedings of the 40th Annual International Symposium on Computer Architecture
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory performance estimation of CUDA programs
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
Automatic data allocation and buffer management for multi-GPU machines
ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Efficient Mapping of Irregular C++ Applications to Integrated GPUs
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
MobileFBP: Designing portable reconfigurable applications for heterogeneous systems
Journal of Systems Architecture: the EUROMICRO Journal
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Leveraging GPUs using cooperative loop speculation
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Graphics Processing Units (GPUs) offer tremendous computational power. CUDA (Compute Unified Device Architecture) provides a multi-threaded parallel programming model, facilitating high performance implementations of general-purpose computations. However, the explicitly managed memory hierarchy and multi-level parallel view make manual development of high-performance CUDA code rather complicated. Hence the automatic transformation of sequential input programs into efficient parallel CUDA programs is of considerable interest. This paper describes an automatic code transformation system that generates parallel CUDA code from input sequential C code, for regular (affine) programs. Using and adapting publicly available tools that have made polyhedral compiler optimization practically effective, we develop a C-to-CUDA transformation system that generates two-level parallel CUDA code that is optimized for efficient data access. The performance of automatically generated code is compared with manually optimized CUDA code for a number of benchmarks. The performance of the automatically generated CUDA code is quite close to hand-optimized CUDA code and considerably better than the benchmarks' performance on a multicore CPU.