Fast multipole methods on graphics processors

Authors:
Nail A. Gumerov;Ramani Duraiswami
Affiliations:
Perceptual Interfaces and Reality Laboratory, Computer Science and UMIACS, University of Maryland, College Park, United States and Fantalgo, LLC, Elkridge, MD, United States;Perceptual Interfaces and Reality Laboratory, Computer Science and UMIACS, University of Maryland, College Park, United States and Fantalgo, LLC, Elkridge, MD, United States
Venue:
Journal of Computational Physics
Year:
2008

Citing 10
Cited 28

A fast algorithm for particle simulations

Journal of Computational Physics
A parallel adaptive fast multipole method

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Implications of hierarchical N-body methods for multiprocessor architectures

ACM Transactions on Computer Systems (TOCS)
Load balancing and data locality in adaptive hierarchical N-body methods: Barnes-Hut, fast multipole, and radiosity

Journal of Parallel and Distributed Computing
Multipole translation theory for the three-dimensional Laplace and Helmholtz equations

SIAM Journal on Scientific Computing
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation

SIAM Journal on Scientific Computing
A fast adaptive multipole algorithm in three dimensions

Journal of Computational Physics
Guest Editors' Introduction: The Top 10 Algorithms

Computing in Science and Engineering
Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,

Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,
Fast multipole method for the biharmonic equation in three dimensions

Journal of Computational Physics

Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

Journal of Parallel and Distributed Computing
Rapid Multipole Graph Drawing on the GPU

Graph Drawing
Nodal discontinuous Galerkin methods on graphics processors

Journal of Computational Physics
A massively parallel adaptive fast-multipole method on heterogeneous architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
GPU accelerated simulations of bluff body flows using vortex particle methods

Journal of Computational Physics
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster

Journal of Computational Physics
Fast evaluation of Helmholtz potential on graphics processing units (GPUs)

Journal of Computational Physics
On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station

Journal of Real-Time Image Processing
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
On well-separated sets and fast multipole methods

Applied Numerical Mathematics
Fast multipole method on GPU: tackling 3-D capacitance extraction on massively parallel SIMD platforms

Proceedings of the 48th Design Automation Conference
Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A massively parallel adaptive fast multipole method on heterogeneous architectures

Communications of the ACM
Acoustic scattering solver based on single level FMM for multi-GPU systems

Journal of Parallel and Distributed Computing
Scaling fast multipole methods up to 4000 GPUs

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Monte Carlo simulation of the Ising model on FPGA

Journal of Computational Physics
Parallel time-space processing model based fast N-body simulation on GPUs

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Adaptive fast multipole methods on the GPU

The Journal of Supercomputing
Computational physics on graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
Parallelization of the FMM on distributed-memory GPGPU systems for acoustic-scattering prediction

The Journal of Supercomputing
Efficient FMM accelerated vortex methods in three dimensions via the Lamb-Helmholtz decomposition

Journal of Computational Physics
A GPU parallelized spectral method for elliptic equations in rectangular domains

Journal of Computational Physics
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing
Least-squares hermite radial basis functions implicits with adaptive sampling

Proceedings of Graphics Interface 2013
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	31.51

Visualization

Abstract

The fast multipole method allows the rapid approximate evaluation of sums of radial basis functions. For a specified accuracy, @e, the method scales as O(N) in both time and memory compared to the direct method with complexity O(N^2), which allows the solution of larger problems with given resources. Graphical processing units (GPU) are now increasingly viewed as data parallel compute coprocessors that can provide significant computational performance at low price. We describe acceleration of the FMM using the data parallel GPU architecture. The FMM has a complex hierarchical (adaptive) structure, which is not easily implemented on data-parallel processors. We described strategies for parallelization of all components of the FMM, develop a model to explain the performance of the algorithm on the GPU architecture; and determined optimal settings for the FMM on the GPU. These optimal settings are different from those on usual CPUs. Some innovations in the FMM algorithm, including the use of modified stencils, real polynomial basis functions for the Laplace kernel, and decompositions of the translation operators, are also described. We obtained accelerations of the Laplace kernel FMM on a single NVIDIA GeForce 8800 GTX GPU in the range of 30-60 compared to a serial CPU FMM implementation. For a problem with a million sources, the summations involved are performed in approximately one second. This performance is equivalent to solving of the same problem at a 43 Teraflop rate if we use straightforward summation.