Signal processing algorithms
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
A high performance parallel algorithm for 1-D FFT
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer
ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
We analyze the bottlenecks in the parallel FFT algorithmand describe optimizations carried out for the algorithm on the BlueGene/L Supercomputer. We identified three avenues for improving theperformance of the algorithm - single-node FFT performance, Alltoall collectiveperformance and overlap of computation and communication. Performanceat all these levels has been optimized using the double-hummer intrinsics of the Blue Gene/L CPU, careful ordering and synchronizationof messages in Alltoall communications and suitable interleaving of messageexchangeswith computations. Using these optimizations, we obtained20% performance improvement over the baseline version on the 64 racksBlue Gene/L system. We give a brief overview of the Alltoall optimizations,describe our computation-communication overlap strategy and present resultsfor strong scaling and weak scaling of parallel FFT on Blue Gene/L.We also discuss the fundamental limits to scaling of the parallel transposealgorithm for computing FFT.