Two and three dimensional FFTS on highly parallel computers
Parallel Computing
An Architecture for a Video Rate Two-Dimensional Fast Fourier Transform Processor
IEEE Transactions on Computers
Discrete-time signal processing
Discrete-time signal processing
FFTs in external or hierarchical memory
The Journal of Supercomputing
Using local memory to boost the performance of FFT algorithms on the CRAY-2 supercomputer
The Journal of Supercomputing
Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
ACM Computing Surveys (CSUR)
High-performance FFT algorithms for the Convex C4/XA supercomputer
The Journal of Supercomputing - Special issue: trends in parallel operating systems
Performing out-of-core FFTs on parallel disk systems
Parallel Computing - Special issues on applications: parallel data servers and applications
Handbook of Real-Time Fast Fourier Transforms: Algorithms to Product Testing
Handbook of Real-Time Fast Fourier Transforms: Algorithms to Product Testing
Multidimensional Digital Signal Processing
Multidimensional Digital Signal Processing
Radix-4 FFT implementation using SIMD multimedia instructions
ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 04
Journal of Computational Physics
Hi-index | 0.00 |
We have developed an efficient implementation to compute the 2D fast Fourier transform (FFT) on a new very long instruction word programmable mediaprocessor. Using instruction-level parallelism and a multimedia instruction set, our radix-4 Cooley-Tukey algorithm optimally maps the FFT computation to the processing resources of the Hitachi/Equator's MAP mediaprocessor. We have also achieved more efficient data I/O and lower data transfer time compared to traditional implementations by processing several columns in parallel during the column-wise stage of row-column decomposition. We used a programmable direct memory access engine and a double-buffering scheme in the data cache to perform the computation and the data transfer in parallel. Our implementation resulted in 22.4 ms total execution time for a 512 × 512-point 2D complex FFT, which is faster than previous single-chip programmable or dedicated solutions. The implementations on two other mediaprocessors, the TriMedia TM1100 and the BOPS ManArray, illustrate the importance of the instruction set architecture for achieving high performance and the trend of data I/O becoming the limitation on the 2D FFT performance in newer mediaprocessors.