ICSE '81 Proceedings of the 5th international conference on Software engineering
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A cross-input adaptive framework for GPU program optimizations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Modeling GPU-CPU workloads and systems
Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
On-the-fly elimination of dynamic irregularities for GPU computing
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Divergence Analysis and Optimizations
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Linear-time Modeling of Program Working Set in Shared Cache
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
A performance analysis framework for identifying potential benefits in GPGPU applications
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A unified optimizing compiler framework for different GPGPU architectures
ACM Transactions on Architecture and Code Optimization (TACO)
Compiling a high-level language for GPUs: (via language support for architectures and compilers)
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Improving performance of OpenCL on CPUs
CC'12 Proceedings of the 21st international conference on Compiler Construction
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
Portable mapping of data parallel programs to OpenCL for heterogeneous systems
CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation
Proceedings of Workshop on General Purpose Processing Using GPUs
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device. In this paper we consider a data parallel compiler transformation --- thread-coarsening --- and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. We achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.