A massively parallel adaptive fast-multipole method on heterogeneous architectures

Authors:
Ilya Lashuk;Aparna Chandramowlishwaran;Harper Langston;Tuan-Anh Nguyen;Rahul Sampath;Aashay Shringarpure;Richard Vuduc;Lexing Ying;Denis Zorin;George Biros
Affiliations:
Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;Georgia Institute of Technology, Atlanta, GA;University of Texas at Austin, Austin, TX;New York University, New York, NY;Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 15
Cited 14

A fast algorithm for particle simulations

Journal of Computational Physics
An introduction to parallel algorithms

An introduction to parallel algorithms
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation

SIAM Journal on Scientific Computing
A unifying data structure for hierarchical methods

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
A fast adaptive multipole algorithm in three dimensions

Journal of Computational Physics
A kernel-independent adaptive fast multipole algorithm in two and three dimensions

Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods

Parallel Computing
Massively parallel implementation of a fast multipole method for distributed memory machines

Journal of Parallel and Distributed Computing
Fast, parallel, GPU-based construction of space filling curves and octrees

Proceedings of the 2008 symposium on Interactive 3D graphics and games
Fast multipole methods on graphics processors

Journal of Computational Physics
Bottom-Up Construction and 2:1 Balance Refinement of Linear Octrees in Parallel

SIAM Journal on Scientific Computing
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Dendro: parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees

Proceedings of the 2008 ACM/IEEE conference on Supercomputing

On the limits of GPU acceleration

HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling Hierarchical N-body Simulations on GPU Clusters

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A massively parallel adaptive fast multipole method on heterogeneous architectures

Communications of the ACM
Reusing non-functional concerns across languages

Proceedings of the 11th annual international conference on Aspect-oriented Software Development
A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach

Computers in Biology and Medicine
Brief announcement: towards a communication optimal fast multipole method and its implications at exascale

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Scaling fast multipole methods up to 4000 GPUs

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems

International Journal of High Performance Computing Applications
Parallelization of the FMM on distributed-memory GPGPU systems for acoustic-scattering prediction

The Journal of Supercomputing
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.02

Visualization

Abstract

We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.