A large-scale cross-architecture evaluation of thread-coarsening

Authors:
Alberto Magni;Christophe Dubach;Michael F. P. O'Boyle
Affiliations:
University of Edinburgh;University of Edinburgh;University of Edinburgh
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 18
Cited 2

Program slicing

ICSE '81 Proceedings of the 5th international conference on Software engineering
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A cross-input adaptive framework for GPU program optimizations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Modeling GPU-CPU workloads and systems

Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
On-the-fly elimination of dynamic irregularities for GPU computing

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Sponge: portable stream programming on graphics engines

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Divergence Analysis and Optimizations

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Linear-time Modeling of Program Working Set in Shared Cache

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
A performance analysis framework for identifying potential benefits in GPGPU applications

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Whole-function vectorization

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A unified optimizing compiler framework for different GPGPU architectures

ACM Transactions on Architecture and Code Optimization (TACO)
Compiling a high-level language for GPUs: (via language support for architectures and compilers)

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Improving performance of OpenCL on CPUs

CC'12 Proceedings of the 21st international conference on Compiler Construction
Automatic restructuring of GPU kernels for exploiting inter-thread data locality

CC'12 Proceedings of the 21st international conference on Compiler Construction
Portable mapping of data parallel programs to OpenCL for heterogeneous systems

CGO '13 Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation

Proceedings of Workshop on General Purpose Processing Using GPUs
Exploiting GPU Hardware Saturation for Fast Compiler Optimization

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device. In this paper we consider a data parallel compiler transformation --- thread-coarsening --- and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. We achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.