FFT algorithms for SIMD parallel processing systems
Journal of Parallel and Distributed Computing
A radix-2 FFT on connection machine
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
FFTs in external or hierarchical memory
The Journal of Supercomputing
Ordered fast Fourier transforms on a massively parallel hypercube multiprocessor
Journal of Parallel and Distributed Computing
Ultrahigh-performance FFTs for the CRAY-2 and CRAY Y-MP supercomputers
The Journal of Supercomputing
Public International Benchmarks for Parallel Computers
Public International Benchmarks for Parallel Computers
The Journal of Supercomputing
A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Scheduling FFT computation on SMP and multicore systems
Proceedings of the 21st annual international conference on Supercomputing
Performance without pain = productivity: data layout and collective communication in UPC
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
IBM Journal of Research and Development
A vector-parallel FFT with a user-specifiable data distribution scheme
ISPA'03 Proceedings of the 2003 international conference on Parallel and distributed processing and applications
Optimization of fast Fourier transforms on the Blue Gene/L supercomputer
HiPC'08 Proceedings of the 15th international conference on High performance computing
Large-scale FFT on GPU clusters
Proceedings of the 24th ACM International Conference on Supercomputing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Portable, MPI-interoperable coarray fortran
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
In this paper we propose a parallel high performance FFT algorithm based on a multi-dimensional formulation. We use this to solve a commonly encountered FFT based kernel on a distributed memory parallel machine, the IBM scalable parallel system, SP1. The kernel requires a forward FFT computation of an input sequence, multiplication of the transformed data by a coefficient array, and finally an inverse FFT computation of the resultant data. We show that the multidimensional formulation helps in reducing the communication costs and also improves the single node performance by effectively utilizing the memory system of the node. We implemented this kernel on the IBM SP1 and observed a performance of 1.25 GFLOPS on a 64-node machine.