Automatic OpenCL work-group size selection for multicore CPUs

Authors:
Sangmin Seo;Jun Lee;Gangwon Jo;Jaejin Lee
Affiliations:
ManyCoreSoft, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea;Seoul National University, Seoul, South Korea
Venue:
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Year:
2013

Citing 22
Cited 0

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A tile selection algorithm for data locality and cache interference

ICS '99 Proceedings of the 13th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
A Comparison of Compiler Tiling Algorithms

CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
A Quantitative Analysis of Tile Size Selection Algorithms

The Journal of Supercomputing
Fast data-locality profiling of native execution

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Counting Integer Points in Parametric Polytopes Using Barvinok's Rational Functions

Algorithmica
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Automatic creation of tile size selection models

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
An OpenCL framework for heterogeneous multicores with local memory

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
All-window profiling and composable models of cache sharing

Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Computer Architecture, Fifth Edition: A Quantitative Approach

Computer Architecture, Fifth Edition: A Quantitative Approach
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Linear-time Modeling of Program Working Set in Shared Cache

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence

PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Analytical bounds for optimal tile size selection

CC'12 Proceedings of the 21st international conference on Compiler Construction
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters

Proceedings of the 26th ACM international conference on Supercomputing
Performance characterization of the NAS Parallel Benchmarks in OpenCL

IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the effect of the work-group size on the performance of OpenCL kernels. We propose a profiling-based algorithm that finds a good work-group size, in terms of performance, for the target multicore CPU architecture. Our algorithm reduces misses in the private L1 data cache and achieves load balancing between cores. It exploits the polyhedral model to estimate the working-set size and the number of cache misses for a parameterized work-group size of the OpenCL kernel. Based on the profiling information, it heuristically searches the space of parameterized work-group sizes. Our virtually-extended index space helps to increase the probability to find a better work-group size. We implement our work-group size selection algorithm as a development tool that consists of a code generator and a search library. The code generator extracts the polytope of each memory reference from the kernel code and generates a function that simplifies polytopes using the run-time information and invokes search library routines. The search library calculates the working-set size using the polytopes and finds a proper work-group size. We evaluate our approach using 31 OpenCL kernels on four different multicore CPUs. We compare its accuracy and search time to those of an exhaustive search method. Experimental results show that our tool is, on average, 1566 times faster than the exhaustive search and selects a work-group size whose performance is the same as or comparable to that of the exhaustive search.