Generating SIMD vectorized permutations

Authors:
Franz Franchetti;Markus Püschel
Affiliations:
Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA;Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA
Venue:
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Year:
2008

Citing 10
Cited 3

Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures

Circuits, Systems, and Signal Processing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
An Efficient Algorithm for Out-of-Core Matrix Transposition

IEEE Transactions on Computers
Short Vector Code Generation for the Discrete Fourier Transform

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Combining analytical and empirical approaches in tuning matrix transposition

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
A rewriting system for the vectorization of signal transforms

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science

Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Proceedings of the international conference on Supercomputing
Computer generation of efficient software viterbi decoders

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a method to generate efficient vectorized implementations of small stride permutations using only vector load and vector shuffle instructions. These permutations are crucial for highperformance numerical kernels including the fast Fourier transform. Our generator takes as input only the specification of the target platform's SIMD vector ISA and the desired permutation. The basic idea underlying our generator is to model vector instructions as matrices and sequences of vector instructions as matrix formulas using the Kronecker product formalism. We design a rewriting system and a search mechanism that applies matrix identities to generate those matrix formulas that have vector structure and minimize a cost measure that we define. The formula is then translated into the actual vector program for the specified permutation. For three important classes of permutations, we show that our method yields a solution with the minimal number of vector shuffles. Inserting into a fast Fourier transform yields a significant speedup.