Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Proceedings of the International Symposium on Code Generation and Optimization
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Program optimization space pruning for a multithreaded gpu
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
A compiler framework for optimization of affine loop nests for gpgpus
Proceedings of the 22nd annual international conference on Supercomputing
High performance discrete Fourier transforms on graphics processors
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
OpenMP to GPGPU: a compiler framework for automatic translation and optimization
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
Compilers: Principles, Techniques, & Tools with Gradiance
Compilers: Principles, Techniques, & Tools with Gradiance
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
Proceedings of the 36th annual international symposium on Computer architecture
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
An adaptive performance modeling tool for GPU architectures
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Many-Thread Aware Prefetching Mechanisms for GPGPU Applications
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Achieving a single compute device image in OpenCL for multiple GPUs
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Reducing branch divergence in GPU programs
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units
A programming language interface to describe transformations and code generation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Unified parallel C for GPU clusters: language extensions and compiler implementation
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the international conference on Supercomputing
Elastic pipeline: addressing GPU on-chip shared memory bank conflicts
Proceedings of the 8th ACM International Conference on Computing Frontiers
CuMAPz: a tool to analyze memory access patterns in CUDA
Proceedings of the 48th Design Automation Conference
Model-driven tile size selection for DOACROSS loops on GPUs
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Automatic OpenCL device characterization: guiding optimized kernel design
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
ACM SIGARCH Computer Architecture News
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Speculative parallelization on GPGPUs
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Parallelizing SOR for GPGPUs using alternate loop tiling
Parallel Computing
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Adaptive input-aware compilation for graphics engines
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Characterizing and improving the use of demand-fetched caches in GPUs
Proceedings of the 26th ACM international conference on Supercomputing
One stone two birds: synchronization relaxation and redundancy removal in GPU-CPU translation
Proceedings of the 26th ACM international conference on Supercomputing
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
A compiler-assisted runtime-prefetching scheme for heterogeneous platforms
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Financial software on GPUs: between Haskell and Fortran
Proceedings of the 1st ACM SIGPLAN workshop on Functional high-performance computing
Shared memory multiplexing: a novel way to improve GPGPU throughput
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Integration of Dataflow-Based Heterogeneous Multiprocessor Scheduling Techniques in GNU Radio
Journal of Signal Processing Systems
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
KFusion: optimizing data flow without compromising modularity
Proceedings of the 12th annual international conference on Aspect-oriented software development
Memory reuse optimizations in the R-Stream compiler
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Cost-effective soft-error protection for SRAM-based structures in GPGPUs
Proceedings of the ACM International Conference on Computing Frontiers
Semi-automatic restructuring of offloadable tasks for many-core accelerators
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memory performance estimation of CUDA programs
ACM Transactions on Embedded Computing Systems (TECS) - Special issue on application-specific processors
ACM Transactions on Programming Languages and Systems (TOPLAS)
User transparent data and task parallel multimedia computing with Pyxis-DT
Future Generation Computer Systems
Portable and Transparent Host-Device Communication Optimization for GPGPU Environments
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
CUDA-NP: realizing nested thread-level parallelism in GPGPU applications
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
The Cetus Source-to-Source Compiler Infrastructure: Overview and Evaluation
International Journal of Parallel Programming
The Implementation of a High Performance GPGPU Compiler
International Journal of Parallel Programming
An efficient compiler framework for cache bypassing on GPUs
Proceedings of the International Conference on Computer-Aided Design
Optimizing convolution operations on GPUs using adaptive tiling
Future Generation Computer Systems
Hi-index | 0.00 |
This paper presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and judicious management of parallelism. The input to our compiler is a naïve GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler analyzes the code, identifies its memory access patterns, and generates both the optimized kernel and the kernel invocation parameters. Our optimization process includes vectorization and memory coalescing for memory bandwidth enhancement, tiling and unrolling for data reuse and parallelism management, and thread block remapping or address-offset insertion for partition-camping elimination. The experiments on a set of scientific and media processing algorithms show that our optimized code achieves very high performance, either superior or very close to the highly fine-tuned library, NVIDIA CUBLAS 2.2, and up to 128 times speedups over the naive versions. Another distinguishing feature of our compiler is the understandability of the optimized code, which is useful for performance analysis and algorithm refinement.