Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Authors:
Daniel S. McFarlin;Volodymyr Arbatov;Franz Franchetti;Markus Püschel
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;Carnegie Mellon University, Pittsburgh, PA, USA;ETH Zurich, Zurich, Switzerland
Venue:
Proceedings of the international conference on Supercomputing
Year:
2011

Citing 23
Cited 1

Superoptimizer: a look at the smallest program

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures

Circuits, Systems, and Signal Processing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
A vectorizing compiler for multimedia extensions

International Journal of Parallel Programming - Special issue on instruction-level parallelism and parallelizing compilation, Part 1
Optimizing compilers for modern architectures: a dependence-based approach

Optimizing compilers for modern architectures: a dependence-based approach
Compiling for SIMD Within a Register

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Short Vector Code Generation for the Discrete Fourier Transform

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Formal loop merging for signal transforms

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Generation of permutations for SIMD processors

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
Optimizing data permutations for SIMD devices

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Automatic generation of peephole superoptimizers

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Larrabee: a many-core x86 architecture for visual computing

ACM SIGGRAPH 2008 papers
On Genetic Algorithms for Boolean Matrix Factorization

ISDA '08 Proceedings of the 2008 Eighth International Conference on Intelligent Systems Design and Applications - Volume 02
A SIMD optimization framework for retargetable compilers

ACM Transactions on Architecture and Code Optimization (TACO)
Mining discrete patterns via binary matrix factorization

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Operator Language: A Program Generation Framework for Fast Kernels

DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
A rewriting system for the vectorization of signal transforms

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Generating SIMD vectorized permutations

CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction

An implementation of parallel 2-d FFT using intel AVX instructions on multi-core processors

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

The well-known shift to parallelism in CPUs is often associated with multicores. However another trend is equally salient: the increasing parallelism in per-core single-instruction multiple-date (SIMD) vector units. Intel's SSE and IBM's VMX (compatible to AltiVec) both offer 4-way (single precision) floating point, but the recent Intel instruction sets AVX and Larrabee (LRB) offer 8-way and 16-way, respectively. Compilation and optimization for vector extensions is hard, and often the achievable speed-up by using vectorizing compilers is small compared to hand-optimization using intrinsic function interfaces. Unfortunately, the complexity of these intrinsics interfaces increases considerably with the vector length, making hand-optimization a nightmare. In this paper, we present a peephole-based vectorization system that takes as input the vector instruction semantics and outputs a library of basic data reorganization blocks such as small transpositions and perfect shuffles that are needed in a variety of high performance computing applications. We evaluate the system by generating the blocks needed by the program generator Spiral for vectorized fast Fourier transforms (FFTs). With the generated FFTs we achieve a vectorization speed-up of 5.5--6.5 for 8-way AVX and 10--12.5 for 16-way LRB. For the latter instruction counts are used since no timing information is available. The combination of the proposed system and Spiral thus automates the production of high performance FFTs for current and future vector architectures.