A fast algorithm for particle simulations
Journal of Computational Physics
Rapid solution of integral equations of scattering theory in two dimensions
Journal of Computational Physics
Fast evaluation of three-dimensional transient wave fields using diagonal translation operators
Journal of Computational Physics
A fast adaptive multipole algorithm in three dimensions
Journal of Computational Physics
Accelerating Fast Multipole Methods for the Helmholtz Equation at Low Frequencies
IEEE Computational Science & Engineering
A wideband fast multipole method for the Helmholtz equation in three dimensions
Journal of Computational Physics
Journal of Computational Physics
Fast multipole methods on graphics processors
Journal of Computational Physics
Parallel accelerated cartesian expansions for particle dynamics simulations
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
A multilevel Cartesian non-uniform grid time domain algorithm
Journal of Computational Physics
A precorrected-FFT method for electrostatic analysis of complicated 3-D structures
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
A multilevel Cartesian non-uniform grid time domain algorithm
Journal of Computational Physics
Acoustic scattering solver based on single level FMM for multi-GPU systems
Journal of Parallel and Distributed Computing
A Cartesian non-uniform grid interpolation method for fast field evaluation on elongated domains
International Journal of Numerical Modelling: Electronic Networks, Devices and Fields
Hi-index | 31.45 |
This paper presents a parallel algorithm implemented on graphics processing units (GPUs) for rapidly evaluating spatial convolutions between the Helmholtz potential and a large-scale source distribution. The algorithm implements a non-uniform grid interpolation method (NGIM), which uses amplitude and phase compensation and spatial interpolation from a sparse grid to compute the field outside a source domain. NGIM reduces the computational time cost of the direct field evaluation at N observers due to N co-located sources from O(N^2) to O(N) in the static and low-frequency regimes, to O(NlogN) in the high-frequency regime, and between these costs in the mixed-frequency regime. Memory requirements scale as O(N) in all frequency regimes. Several important differences between CPU and GPU implementations of the NGIM are required to result in optimal performance on respective platforms. In particular, in the CPU implementations all operations, where possible, are pre-computed and stored in memory in a preprocessing stage. This reduces the computational time but significantly increases the memory consumption. In the GPU implementations, where handling memory often is a critical bottle neck, several special memory handling techniques are used to accelerate the computations. A significant latency of the GPU global memory access is hidden by implementing coalesced reading, which requires arranging many array elements in contiguous parts of memory. Contrary to the CPU version, most of the steps in the GPU implementations are executed on-fly and only necessary arrays are kept in memory. This results in significantly reduced memory consumption, increased problem size N that can be handled, and reduced computational time on GPUs. The obtained GPU-CPU speed-up ratios are from 150 to 400 depending on the required accuracy and problem size. The presented method and its CPU and GPU implementations can find important applications in various fields of physics and engineering.