Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
Custom-optimized multiplierless implementations of DSP algorithms
Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design
Challenges in exploitation of loop parallelism in embedded applications
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
FFT program generation for shared memory: SMP and multicore
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
On the exploitation of loop-level parallelism in embedded applications
ACM Transactions on Embedded Computing Systems (TECS)
How to Write Fast Numerical Code: A Small Introduction
Generative and Transformational Techniques in Software Engineering II
Operator Language: A Program Generation Framework for Fast Kernels
DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Vectorization techniques for the Blue Gene/L double FPU
IBM Journal of Research and Development
A rewriting system for the vectorization of signal transforms
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Generating SIMD vectorized permutations
CC'08/ETAPS'08 Proceedings of the Joint European Conferences on Theory and Practice of Software 17th international conference on Compiler construction
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets
Proceedings of the international conference on Supercomputing
Automatically tuned FFTs for bluegene/l's double FPU
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Two-dimensional fast cosine transform for Vector-STA architectures
SAMOS'05 Proceedings of the 5th international conference on Embedded Computer Systems: architectures, Modeling, and Simulation
A code generation approach for auto-vectorization in the SPADE compiler
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Hi-index | 0.00 |
In this paper we use a mathematical approach to automatically generate high performance short vector code for the discrete Fourier transform (DFT). We represent the well-known Cooley-Tukey fast Fourier transform in a mathematical notation and formally derive a "short vector variant". Using this recursion we generate for a given DFT a large number of different algorithms, represented as formulas, and translate them into short vector code. Then we present a vector code specific dynamic programming method that searches in the space of different implementations for the fastest on the given architecture. We implemented this approach as part of the SPIRAL library generator. On Pentium III and 4, our automatically generated SSE and SSE2 vector code compares favorably with the hand-tuned Intel vendorlibrary.