Improving performance of OpenCL on CPUs

Authors:
Ralf Karrenberg;Sebastian Hack
Affiliations:
Saarland University, Germany;Saarland University, Germany
Venue:
CC'12 Proceedings of the 21st international conference on Compiler Construction
Year:
2012

Citing 16
Cited 4

Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Parallel loop transformation techniques for vector-based multiprocessor systems

Parallel loop transformation techniques for vector-based multiprocessor systems
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Scheduling and Automatic Parallelization

Scheduling and Automatic Parallelization
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
RenderMan: Pursuing the Future of Graphics

IEEE Computer Graphics and Applications
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Introducing Control Flow into Vectorized Code

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Languages and Compilers for Parallel Computing
RTSL: a Ray Tracing Shading Language

RT '07 Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing
MacroSS: macro-SIMDization of streaming applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Whole-function vectorization

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Intel's Array Building Blocks: A retargetable, dynamic compiler and embedded language

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Sierra: a SIMD extension for C++

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
OpenCL framework for ARM processors with NEON support

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-parallel languages like OpenCL and CUDA are an important means to exploit the computational power of today's computing devices. In this paper, we deal with two aspects of implementing such languages on CPUs: First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization. We evaluate our techniques in a custom OpenCL CPU driver which is compared to itself in different configurations and to proprietary implementations by AMD and Intel. We achieve an average speedup factor of 1.21 compared to naïve vectorization and additional factors of 1.15---2.09 for suited kernels due to the optimizations enabled by our analysis. Our best configuration achieves an average speedup factor of 2.5 against the Intel driver.