A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
A tile selection algorithm for data locality and cache interference
ICS '99 Proceedings of the 13th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
A Comparison of Compiler Tiling Algorithms
CC '99 Proceedings of the 8th International Conference on Compiler Construction, Held as Part of the European Joint Conferences on the Theory and Practice of Software, ETAPS'99
A Quantitative Analysis of Tile Size Selection Algorithms
The Journal of Supercomputing
Fast data-locality profiling of native execution
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Positivity, posynomials and tile size selection
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Automatic creation of tile size selection models
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
An OpenCL framework for heterogeneous multicores with local memory
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
All-window profiling and composable models of cache sharing
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Computer Architecture, Fifth Edition: A Quantitative Approach
Computer Architecture, Fifth Edition: A Quantitative Approach
Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Linear-time Modeling of Program Working Set in Shared Cache
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence
PACT '11 Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques
Analytical bounds for optimal tile size selection
CC'12 Proceedings of the 21st international conference on Compiler Construction
SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters
Proceedings of the 26th ACM international conference on Supercomputing
Performance characterization of the NAS Parallel Benchmarks in OpenCL
IISWC '11 Proceedings of the 2011 IEEE International Symposium on Workload Characterization
Hi-index | 0.00 |
In this paper, we address the effect of the work-group size on the performance of OpenCL kernels. We propose a profiling-based algorithm that finds a good work-group size, in terms of performance, for the target multicore CPU architecture. Our algorithm reduces misses in the private L1 data cache and achieves load balancing between cores. It exploits the polyhedral model to estimate the working-set size and the number of cache misses for a parameterized work-group size of the OpenCL kernel. Based on the profiling information, it heuristically searches the space of parameterized work-group sizes. Our virtually-extended index space helps to increase the probability to find a better work-group size. We implement our work-group size selection algorithm as a development tool that consists of a code generator and a search library. The code generator extracts the polytope of each memory reference from the kernel code and generates a function that simplifies polytopes using the run-time information and invokes search library routines. The search library calculates the working-set size using the polytopes and finds a proper work-group size. We evaluate our approach using 31 OpenCL kernels on four different multicore CPUs. We compare its accuracy and search time to those of an exhaustive search method. Experimental results show that our tool is, on average, 1566 times faster than the exhaustive search and selects a work-group size whose performance is the same as or comparable to that of the exhaustive search.