Whole-function vectorization

Authors:
Ralf Karrenberg;Sebastian Hack
Affiliations:
Saarland University;Saarland University
Venue:
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Year:
2011

Citing 21
Cited 12

A vectorizing Fortran compiler

IBM Journal of Research and Development
Automatic translation of FORTRAN programs to vector form

ACM Transactions on Programming Languages and Systems (TOPLAS)
Detecting equality of variables in programs

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Implementation of a portable nested data-parallel language

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel loop transformation techniques for vector-based multiprocessor systems

Parallel loop transformation techniques for vector-based multiprocessor systems
Making graphs reducible with controlled node splitting

ACM Transactions on Programming Languages and Systems (TOPLAS)
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Scheduling and Automatic Parallelization

Scheduling and Automatic Parallelization
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Folklore confirmed: reducible flow graphs are exponentially larger

POPL '03 Proceedings of the 30th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
RenderMan: Pursuing the Future of Graphics

IEEE Computer Graphics and Applications
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Superword-Level Parallelism in the Presence of Control Flow

Proceedings of the international symposium on Code generation and optimization
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Introducing Control Flow into Vectorized Code

PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
Outer-loop vectorization: revisited for short SIMD architectures

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
RTSL: a Ray Tracing Shading Language

RT '07 Proceedings of the 2007 IEEE Symposium on Interactive Ray Tracing

Extending a C-like language for portable SIMD programming

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Improving performance of OpenCL on CPUs

CC'12 Proceedings of the 21st international conference on Compiler Construction
Extending OpenMP* with vector constructs for modern multicore SIMD architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Semantics-preserving data layout transformations for improved vectorisation

Proceedings of the 2nd ACM SIGPLAN workshop on Functional high-performance computing
A large-scale cross-architecture evaluation of thread-coarsening

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hybrid type legalization for a sparse SIMD instruction set

ACM Transactions on Architecture and Code Optimization (TACO)
Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Vectorization past dependent branches through speculation

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Divergence analysis

ACM Transactions on Programming Languages and Systems (TOPLAS)
Sierra: a SIMD extension for C++

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
OpenCL framework for ARM processors with NEON support

Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data-parallel programming languages are an important component in today's parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and "general-purpose" languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to implement parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel's SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing dataparallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorporated in a compiler for a domain-specific language used in realtime ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels.