Approximate complex polynomial evaluation in near constant work per point
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Using a Fast Multipole Method to Accelerate Spline Evaluations
IEEE Computational Science & Engineering
Journal of Computational Physics
A kernel-independent adaptive fast multipole algorithm in two and three dimensions
Journal of Computational Physics
Efficient fast multipole method for low-frequency scattering
Journal of Computational Physics
Communications overlapping in fast multipole particle dynamics methods
Journal of Computational Physics
Massively parallel implementation of a fast multipole method for distributed memory machines
Journal of Parallel and Distributed Computing
High performance BLAS formulation of the multipole-to-local operator in the fast multipole method
Journal of Computational Physics
Automatic Generation of FFT for Translations of Multipole Expansions in Spherical Harmonics
International Journal of High Performance Computing Applications
Journal of Computational Physics
High performance BLAS formulation of the adaptive Fast Multipole Method
Mathematical and Computer Modelling: An International Journal
Hi-index | 0.03 |
This paper describes an ${\cal O}(p^2 \log_2(p) N)$ implementation of the fast multipole algorithm (FMA) for $N$-body simulations. This method of computing the FMA is faster than the original, which is ${\cal O}(p^4N)$, where $p$ is the number of terms retained in the truncated multipole expansion representation of the potential field of a collection of charged particles. The $p$ term determines the accuracy of the calculation. The limiting ${\cal O}(p^4)$ computation in the original FMA is a convolution-like operation on a matrix of multipole coefficients. This paper describes the implementation details of a conversion of this limiting computation to linear convolution, which is then computed in the Fourier domain using the fast Fourier transform (FFT), based on a method originally outlined by Greengard and Rokhlin. In addition, this paper describes a new block decomposition of the multipole expansion data that provides numerical stability and efficient computation. The resulting ${\cal O}(p^2 \log_2(p))$ subroutine has a speedup of 2 on a sequential processor over the original method for $p=8$, and a speedup of 4 for $p=16$. The new subroutine vectorizes well and has a speedup of 3 on a vector processor at $p=8$ and a speedup of 6 at $p=16$.