The Jalapeño dynamic optimizing compiler for Java
JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
Code Generation for a One-Register Machine
Journal of the ACM (JACM)
Code selection for media processors with SIMD instructions
DATE '00 Proceedings of the conference on Design, automation and test in Europe
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Compilation techniques for multimedia processors
International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Register-sensitive selection, duplication, and sequencing of instructions
ICS '01 Proceedings of the 15th international conference on Supercomputing
Optimizing compilers for modern architectures: a dependence-based approach
Optimizing compilers for modern architectures: a dependence-based approach
Increasing and Detecting Memory Address Congruence
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Soot - a Java bytecode optimization framework
CASCON '99 Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research
Optimal code generation for expression trees
STOC '75 Proceedings of seventh annual ACM symposium on Theory of computing
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Generation of permutations for SIMD processors
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Exploiting Vector Parallelism in Software Pipelined Loops
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Compiler optimizations for processors with SIMD instructions
Software—Practice & Experience
Near-optimal instruction selection on dags
Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Efficient SIMD code generation for irregular kernels
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
SIMD defragmenter: efficient ILP realization on data-parallel architectures
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
A compiler framework for extracting superword level parallelism
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
Dynamic compilation of data-parallel kernels for vector processors
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Hybrid type legalization for a sparse SIMD instruction set
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.00 |
Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and open-source compilers fall short of using the full potential of vector units, and only generate vector code for simple innermost loops. In this paper, we present the design and implementation of anauto-vectorization framework in the back-end of a dynamic compiler that not only generates optimized vector code but is also well integrated with the instruction scheduler and register allocator. The framework includes a novel{\em compile-time efficient dynamic programming-based} vector instruction selection algorithm for straight-line code that expands opportunities for vectorization in the following ways: (1) {\em scalar packing} explores opportunities of packing multiple scalar variables into short vectors, (2)judicious use of {\em shuffle} and {\em horizontal} vector operations, when possible, and (3) {\em algebraic reassociation} expands opportunities for vectorization by algebraic simplification. We report performance results on the impact of auto-vectorization on a set of standard numerical benchmarks using the Jikes RVM dynamic compilation environment. Our results show performance improvement of up to 57.71\% on an Intel Xeon processor, compared tonon-vectorized execution, with a modest increase in compile-time in the range from 0.87\% to 9.992\%. An investigation of the SIMD parallelization performed by v11.1 of the Intel Fortran Compiler (IFC) on three benchmarks shows that our system achieves speedup with vectorization in all three cases and IFC does not. Finally, a comparison of our approach with an implementation of the Super word Level Parallelization (SLP) algorithm from~\cite{larsen00}, shows that our approach yields a performance improvement of up to 13.78\% relative to SLP.