A massively parallel adaptive fast multipole method on heterogeneous architectures

Authors:
Ilya Lashuk;Aparna Chandramowlishwaran;Harper Langston;Tuan-Anh Nguyen;Rahul Sampath;Aashay Shringarpure;Richard Vuduc;Lexing Ying;Denis Zorin;George Biros
Affiliations:
Lawrence Livermore National Laboratory, Livermore, CA;College of Computing, Atlanta, GA;College of Computing, Atlanta, GA;College of Computing, Atlanta, GA;Oak Ridge National Laboratory, Oak Ridge, TN;-;College of Computing, Atlanta, GA;University of Texas at Austin, TX;New York University, New York, NY;The University of Texas at Austin, TX
Venue:
Communications of the ACM
Year:
2012

Citing 19
Cited 2

A fast algorithm for particle simulations

Journal of Computational Physics
An introduction to parallel algorithms

An introduction to parallel algorithms
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Separators for sphere-packings and nearest neighbor graphs

Journal of the ACM (JACM)
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation

SIAM Journal on Scientific Computing
Algorithms for dynamic closest pair and n-body potential fields

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Fast Evaluation of Radial Basis Functions: Methods for Generalized Multiquadrics in $\RR^\protectn$

SIAM Journal on Scientific Computing
Scalable parallel formulations of the barnes-hut method for n-body simulations

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Provably Optimal, Distribution-Independent Parallel Fast Multipole Method

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
A kernel-independent adaptive fast multipole algorithm in two and three dimensions

Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Massively parallel implementation of a fast multipole method for distributed memory machines

Journal of Parallel and Distributed Computing
Fast multipole methods on graphics processors

Journal of Computational Physics
Bottom-Up Construction and 2:1 Balance Refinement of Linear Octrees in Parallel

SIAM Journal on Scientific Computing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A massively parallel adaptive fast-multipole method on heterogeneous architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods

Parallel Computing
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

HykSort: a new variant of hypercube quicksort on distributed memory architectures

Proceedings of the 27th international ACM conference on International conference on supercomputing
Algorithms for high-throughput disk-to-disk sorting

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	48.22

Visualization

Abstract

We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body nonoscillatory potentials in three dimensions on heterogeneous high performance computing architectures. We have performed scalability tests with up to 30 billion particles on 196,608 cores on the AMD/CRAY-based Jaguar system at ORNL. On a GPU-enabled system (NSF's Keeneland at Georgia Tech/ORNL), we observed 30× speedup over a single core CPU and 7× speedup over a multicore CPU implementation. By combining GPUs with MPI, we achieve less than 10 ns/particle and six digits of accuracy for a run with 48 million nonuniformly distributed particles on 192 GPUs.