Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors

Authors:
Xinmin Tian;Hideki Saito;Milind Girkar;Serguei V. Preis;Sergey S. Kozhukhov;Aleksei G. Cherkasov;Clark Nelson;Nikolay Panchenko;Robert Geva
Affiliations:
-;-;-;-;-;-;-;-;-
Venue:
IPDPSW '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum
Year:
2012

Citing 0
Cited 1

Extending OpenMP* with vector constructs for modern multicore SIMD architectures

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World

Quantified Score

Hi-index	0.00

Visualization

Abstract

SIMD vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel® SSE, AVX, and IBM* AltiVec. However, most of the focus has been directed at loops, effectively executing their iterations on multiple SIMD lanes concurrently relying upon program hints and compiler analysis. This paper presents a set of new C/C++ high-level vector extensions for SIMD programming, and the Intel® C++ product compiler that is extended to translate these vector extensions and produce optimized SIMD instruction sequences of vectorized functions and loops. For a function, our main idea is to vectorize the entire function for callers instead of just vectorizing loops (if any) inside the function. It poses the challenge of dealing with complicated control-flow in the function body, and matching caller and callee for SIMD vector calls while vectorizing caller functions (or loops) and callee functions. Our compilation methods for automatically compiling vector extensions are described. We present performance results of several non-trivial visual computing, computational, and simulation workloads, utilizing SIMD units through the vector extensions on Intel® Multicore 128-bit SIMD processors, and we show that significant SIMD speedups (3.07x to 4.69x) are achieved over the serial execution.