A fast algorithm for particle simulations
Journal of Computational Physics
Scans as Primitive Parallel Operations
IEEE Transactions on Computers
A parallel adaptive fast multipole method
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Journal of Parallel and Distributed Computing
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation
SIAM Journal on Scientific Computing
$7.0/Mflops astrophysical N-body simulation with treecode on GRAPE-5
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A fast adaptive multipole algorithm in three dimensions
Journal of Computational Physics
Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A kernel-independent adaptive fast multipole algorithm in two and three dimensions
Journal of Computational Physics
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling)
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Fast multipole methods on graphics processors
Journal of Computational Physics
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling fast multipole methods up to 4000 GPUs
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
A peta-scalable CPU-GPU algorithm for global atmospheric simulations
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient FMM accelerated vortex methods in three dimensions via the Lamb-Helmholtz decomposition
Journal of Computational Physics
An (almost) direct deployment of the Fast Multipole Method on the Cell processor
The Journal of Supercomputing
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.00 |
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.