A transpose-free in-place SIMD optimized FFT

Authors:
James R. Geraci;Sharon M. Sacco
Affiliations:
MIT Lincoln Laboratory;MIT Lincoln Laboratory
Venue:
ACM Transactions on Architecture and Code Optimization (TACO)
Year:
2012

Citing 14
Cited 0

FFTs in external or hierarchical memory

The Journal of Supercomputing
Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer

The Journal of Supercomputing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
An Adaptation of the Fast Fourier Transform for Parallel Processing

Journal of the ACM (JACM)
Array Permutation by Index-Digit Permutation

Journal of the ACM (JACM)
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
FFT program generation for shared memory: SMP and multicore

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance discrete Fourier transforms on graphics processors

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Fast Fourier Transforms: for fun and profit

AFIPS '66 (Fall) Proceedings of the November 7-10, 1966, fall joint computer conference
Computer generation of fast fourier transforms for the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
FFTC: fastest Fourier transform for the IBM cell broadband engine

HiPC'07 Proceedings of the 14th international conference on High performance computing
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Some computer organizations and their effectiveness

IEEE Transactions on Computers
Using GPUs to compute large out-of-card FFTs

Proceedings of the international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A transpose-free in-place SIMD optimized algorithm for the computation of large FFTs is introduced and implemented on the Cell Broadband Engine. Six different FFT implementations of the algorithm using six different data movement methods are described. Their relative performance is compared for input sizes from 217 to 221 complex floating point samples. Large differences in performance are observed among even theoretically equivalent data movement patterns. All six implementations compare favorably with FFTW and other previous FFT implementations.