Generation of permutations for SIMD processors

Authors:
Alexei Kudriavtsev;Peter Kogge
Affiliations:
University of Notre Dame;University of Notre Dame
Venue:
LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Year:
2005

Citing 8
Cited 15

Code generation using tree matching and dynamic programming

ACM Transactions on Programming Languages and Systems (TOPLAS)
Advanced compiler design and implementation

Advanced compiler design and implementation
Code selection for media processors with SIMD instructions

DATE '00 Proceedings of the conference on Design, automation and test in Europe
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools

Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools
Increasing and Detecting Memory Address Congruence

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Vectorizing for a SIMdD DSP architecture

Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation

Exploiting Vector Parallelism in Software Pipelined Loops

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Vector LLVA: a virtual vector instruction set for media processing

Proceedings of the 2nd international conference on Virtual execution environments
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Retargetable code optimization with SIMD instructions

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Pack instruction generation for media pUsing multi-valued decision diagram

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Optimal bit-reversal using vector permutations

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Interactive presentation: SoftSIMD - exploiting subword parallelism using source code transformations

Proceedings of the conference on Design, automation and test in Europe
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
A SIMD optimization framework for retargetable compilers

ACM Transactions on Architecture and Code Optimization (TACO)
Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Automatic vector instruction selection for dynamic compilation

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient Selection of Vector Instructions Using Dynamic Programming

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Proceedings of the international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Short vector (SIMD) instructions are useful in signal processing, multimedia, and scientific applications. They offer higher performance, lower energy consumption, and better resource utilization. However, compilers still do not have good support for SIMD instructions, and often the code has to be written manually in assembly language or using compiler builtin functions. Also, in some applications, higher parallelism could be achieved if compilers inserted permutation instructions that reorder the data in registers. In this paper we describe how we create SIMD instructions from regular code, and determine ordering of individual operations in the SIMD instructions to minimize the number of permutation instructions. Individual memory operations are grouped into SIMD operations based on their effective addresses. The SIMD data flow graph is then constructed by following data dependences from SIMD memory operations. Then, the orderings of operations are propagated from SIMD memory operations into the graph.We also describe our approach to compute decomposition of a given permutation into the permutation instructions of the target architecture. Experiments with our prototype compiler show that this approach scales well with the number of operations in SIMD instructions (SIMD width) and can be used to compile a number of important kernels, achieving up to 35% speedup.