Efficient Selection of Vector Instructions Using Dynamic Programming

Authors:
Rajkishore Barik;Jisheng Zhao;Vivek Sarkar
Affiliations:
-;-;-
Venue:
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2010

Citing 18
Cited 5

The Jalapeño dynamic optimizing compiler for Java

JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
Code Generation for a One-Register Machine

Journal of the ACM (JACM)
Code selection for media processors with SIMD instructions

DATE '00 Proceedings of the conference on Design, automation and test in Europe
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Register-sensitive selection, duplication, and sequencing of instructions

ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
MMX Technology Extension to the Intel Architecture

IEEE Micro
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Soot - a Java bytecode optimization framework

CASCON '99 Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research
Optimal code generation for expression trees

STOC '75 Proceedings of seventh annual ACM symposium on Theory of computing
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Exploiting Vector Parallelism in Software Pipelined Loops

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler optimizations for processors with SIMD instructions

Software—Practice & Experience
Near-optimal instruction selection on dags

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization

Efficient SIMD code generation for irregular kernels

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
SIMD defragmenter: efficient ILP realization on data-parallel architectures

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
A compiler framework for extracting superword level parallelism

Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamic compilation of data-parallel kernels for vector processors

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hybrid type legalization for a sparse SIMD instruction set

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of anauto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel{\em compile-time efficient dynamic programming-based} vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) {\em scalar packing} explores opportunities of packing multiple scalar variables into short vectors, (2)judicious use of {\em shuffle} and {\em horizontal} vector operations, when possible, and (3) {\em algebraic reassociation} expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71\% on an Intel Xeon processor, compared tonon-vectorized execution, with a modest increase in compile-time in the range from 0.87\% to 9.992\%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Super word Level Parallelization (SLP) algorithm from~\cite{larsen00}, shows that our approach yields a performance improvement of up to 13.78\% relative to SLP.