FFTs in external or hierarchical memory
The Journal of Supercomputing
Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
The Future Fast Fourier Transform?
SIAM Journal on Scientific Computing
A high performance parallel algorithm for 1-D FFT
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Automatic Performance Tuning in the UHFFT Library
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
A Blocking Algorithm for FFT on Cache-Based Processors
HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
High Performance Communication using a Commodity Network for Cluster Systems
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
The Fastest Fourier Transform in the West
The Fastest Fourier Transform in the West
High Performance FFT Algorithms for Cache-Coherent Multiprocessors
International Journal of High Performance Computing Applications
FFT algorithms for vector computers
Parallel Computing
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
In this paper, we propose a blocking algorithm for a parallel one-dimensional fast Fourier transform (FFT) on clusters of PCs. Our proposed parallel FFT algorithm is based on the six-step FFT algorithm. The six-step FFT algorithm can be altered into a block nine-step FFT algorithm to reduce the number of cache misses. The block nine-step FFT algorithm improves performance by utilizing the cache memory effectively. We use the block nine-step FFT algorithm to design the parallel one-dimensional FFT algorithm. In our proposed parallel FFT algorithm, since we use cyclic distribution, all-to-all communication is required only once. Moreover, the input data and output data are both can be given in natural order. We successfully achieved performance of over 1.3 GFLOPS on an 8-node dual Pentium III 1 GHz PC SMP cluster.