A framework for low-communication 1-D FFT

Authors:
Ping Tak Peter Tang;Jongsoo Park;Daehyun Kim;Vladimir Petrov
Affiliations:
Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA;Intel Corporation, Santa Clara, CA, USA
Venue:
Scientific Programming - Selected Papers from Super Computing 2012
Year:
2013

Citing 20
Cited 0

The fast Fourier transform and its applications

The fast Fourier transform and its applications
FFTs in external or hierarchical memory

The Journal of Supercomputing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
Fast Fourier transforms for nonequispaced data

SIAM Journal on Scientific Computing
The Future Fast Fourier Transform?

SIAM Journal on Scientific Computing
Parallel Implementation of Multidimensional Transforms without Interprocessor Communication

IEEE Transactions on Computers
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
A Parallel Algorithm for 2-D DFT Computation with No Interprocessor Communication

IEEE Transactions on Parallel and Distributed Systems
A parallel 1-D FFT algorithm for the Hitachi SR8000

Parallel Computing
Principles and Practices of Interconnection Networks

Principles and Practices of Interconnection Networks
Optimization of All-to-All Communication on the Blue Gene/L Supercomputer

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Fast Fourier Transforms: for fun and profit

AFIPS '66 (Fall) Proceedings of the November 7-10, 1966, fall joint computer conference
Parallel implementations of 1-D fast Fourier transform without interprocessor communication

International Journal of Computers and Applications
Scalable Tile Communication-Avoiding QR Factorization on Multicore Cluster Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A hybrid parallel M-D FFT algorithm without interprocessor communication

ICASSP'93 Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: digital speech processing - Volume III
Graph expansion and communication costs of fast matrix multiplication: regular submission

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Simple and practical algorithm for sparse Fourier transform

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Nonuniform fast Fourier transforms using min-max interpolation

IEEE Transactions on Signal Processing
Nearly optimal sparse fourier transform

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In high-performance computing on distributed-memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute-density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental computational algorithms has become an important field of research. For distributed 1-D FFT, communication cost has hitherto remained high as all industry-standard implementations perform three all-to-all internode data exchanges also called global transposes. These communication steps indeed dominate execution time. In this paper, we present a mathematical framework from which many single-all-to-all and easy-to-implement 1-D FFT algorithms can be derived. For large-scale problems, our implementation can be twice as fast as leading FFT libraries on state-of-the-art computer clusters. Moreover, our framework allows tradeoff between accuracy and performance, further boosting performance if reduced accuracy is acceptable.