A fast algorithm for particle simulations
Journal of Computational Physics
An introduction to parallel algorithms
An introduction to parallel algorithms
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Separators for sphere-packings and nearest neighbor graphs
Journal of the ACM (JACM)
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation
SIAM Journal on Scientific Computing
Algorithms for dynamic closest pair and n-body potential fields
Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Fast Evaluation of Radial Basis Functions: Methods for Generalized Multiquadrics in $\RR^\protectn$
SIAM Journal on Scientific Computing
Scalable parallel formulations of the barnes-hut method for n-body simulations
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Provably Optimal, Distribution-Independent Parallel Fast Multipole Method
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
A kernel-independent adaptive fast multipole algorithm in two and three dimensions
Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Massively parallel implementation of a fast multipole method for distributed memory machines
Journal of Parallel and Distributed Computing
Fast multipole methods on graphics processors
Journal of Computational Physics
Bottom-Up Construction and 2:1 Balance Refinement of Linear Octrees in Parallel
SIAM Journal on Scientific Computing
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
HykSort: a new variant of hypercube quicksort on distributed memory architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
Algorithms for high-throughput disk-to-disk sorting
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 48.22 |
We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30× speedup over a single core CPU and 7× speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.