Exploring parallelization strategies for NUFFT data translation

Authors:
Yuanrui Zhang;Mahmut Kandemir;Nikos P. Pitsianis;Xiaobai Sun
Affiliations:
The Pennsylvania State University, State College, PA, USA;The Pennsylvania State University, State College, PA, USA;Aristotle University, Thessaloniki, Greece;Duke University, Durham, NC, USA
Venue:
EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Year:
2009

Citing 11
Cited 3

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Fast Fourier transforms for nonequispaced data

SIAM Journal on Scientific Computing
A fast Fourier transform compiler

Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Loop tiling for parallelism

Loop tiling for parallelism
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic parallelization for symmetric shared-memory multiprocessors

CASCON '96 Proceedings of the 1996 conference of the Centre for Advanced Studies on Collaborative research
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
FFT program generation for shared memory: SMP and multicore

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Nonuniform fast Fourier transforms using min-max interpolation

IEEE Transactions on Signal Processing

A special-purpose compiler for look-up table and code generation for function evaluation

Proceedings of the Conference on Design, Automation and Test in Europe
Scalable parallelization strategies to accelerate NuFFT data translation on multicores

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
An algorithm-architecture co-design framework for gridding reconstruction using FPGAs

Proceedings of the 48th Design Automation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces parallelization strategies for the Non-Uniform FFT (NUFFT) data translation on multicore architectures. The NUFFT enables the use of the celebrated FFT with un-equally spaced data in numerous situations in signal and image processing as well as in scientific computing. The critical extension lies at the translation of non-equally spaced or non-uniformly sampled data onto an equally spaced Cartesian grid or vice versa. The data translation can be made sufficiently accurate, with the arithmetic complexity linearly proportional to the size of the data ensemble. For large NUFFTs, however, the data translation is found substantially dominant in computation time on modern computers while it is expected to be dominated by the FFT. In order to match the FFT performance achieved by FFTW, data locality and parallelism in the data translation must be explored and exploited as well. We are concerned with two fundamental issues. First, the data translation can be described as a matrix-vector multiplication with a matrix of irregular sparsity. This is beyond the effective scope of the conventional tiling and parallelization schemes applied by a compiler for performance improvement on computation with dense matrices. Secondly, multicore processors exist and emerge in many different configurations, and are expected to evolve further in architectural variety. This may mean the end of performance tuning on a single type of architecture. In this paper, we introduce an automation tool that takes two specifications as input, one on an application-specific data translation algorithm, the other on a target multicore processor architecture. The tool generates a parallel code that explores the data locality and parallelism by utilizing both geometric structures in data translation and the processor-memory configurations in the target architecture. We present preliminary experimental results on both a simulator and a commercial multicore machine. The results show that our parallelization strategy brings significant performance improvement for the NUFFT data translation by efficiently exploiting the data locality and concurrency in the application.