FFTs in external or hierarchical memory
The Journal of Supercomputing
Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
A generalized prime factor FFT algorithm for any N=2p3q5r
SIAM Journal on Scientific and Statistical Computing
Pseudo vector processor based on register-windowed superscalar pipeline
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Real and complex fast Fourier transforms on the Fujitsu VPP 500
Parallel Computing
CP-PACS: a massively parallel processor for large scale scientific calculations
ICS '97 Proceedings of the 11th international conference on Supercomputing
A high performance parallel algorithm for 1-D FFT
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A VLSI array processing oriented fast fourier transform algorithm and hardware implementation
GLSVLSI '05 Proceedings of the 15th ACM Great Lakes symposium on VLSI
Teraflops Sustained Performance With Real World Applications
International Journal of High Performance Computing Applications
Performance evaluation of supercomputers using HPCC and IMB Benchmarks
Journal of Computer and System Sciences
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
Network bandwidth measurements and ratio analysis with the HPC challenge benchmark suite (HPCC)
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A fast poisson solver for hybrid reconfigurable system
ARC'13 Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications
Portable, MPI-interoperable coarray fortran
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT algorithms. In our parallel FFT algorithms, since we use cyclic distribution, all-to-all communication takes place only once. Moreover, the input data and output data are both in natural order.We also show that the suitability of a parallel FFT algorithm is machine-dependent because of the differences in the architecture of the processor elements in distributed-memory parallel computers. Experimental results of 2^p3^q5^r point FFTs on distributed-memory parallel computers, HITACHI SR2201 and IBM SP2 are reported. We succeeded to get performances of about 130 GFLOPS on a 1024PE HITACHI SR2201 and about 1.25 GFLOPS on a 32PE IBM SP2.