Parallel sparse FFT

Authors:
Cheng Wang;Mauricio Araya-Polo;Sunita Chandrasekaran;Amik St-Cyr;Barbara Chapman;Detlef Hohl
Affiliations:
University of Houston, Houston, TX;Shell International E&P Inc., Houston, TX;University of Houston, Houston, TX;Shell International E&P Inc., Houston, TX;University of Houston, Houston, TX;Shell International E&P Inc., Houston, TX
Venue:
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
Year:
2013

Citing 3
Cited 0

Simple and practical algorithm for sparse Fourier transform

Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms
Nearly optimal sparse fourier transform

STOC '12 Proceedings of the forty-fourth annual ACM symposium on Theory of computing
Sparse Fast Fourier Transform on GPUs and Multi-core CPUs

SBAC-PAD '12 Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Fast Fourier Transform (FFT) is a widely used numerical algorithm. When N input data points lead to only k N non-zero coefficients in the transformed domain, the algorithm is clearly inefficient: the FFT performs O(NlogN) operations on N input data points in order to calculate only k non-zero or large coefficients, and N -- k zero or negligibly small ones. The recently developed sparse FFT (sFFT) algorithm provides a solution to this problem. As are those for the FFT, sFFT algorithms are complex and still computationally challenging. The computational difficulties are mainly due to memory access patterns that are irregular and dynamically changing. Modern compute platforms are exclusively based on multi-core processors, therefore a natural path to enhance the sFFT's performance is to exploit parallelism. This is the approach chosen in this work. We have analyzed in detail and parallelized the most time consuming segments of the algorithm. Our parallel sFFT (PsFFT) implementation achieves approximately 60% parallel efficiency on a single 8-core Intel Sandy Bridge socket for relevant test cases. In addition, we apply several techniques such as index coalescing, data affiliated loops and multi-level blocking techniques to alleviate memory access congestion and increase performance.