A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Authors:
Jee Choi;Aparna Chandramowlishwaran;Kamesh Madduri;Richard Vuduc
Affiliations:
Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA;CSAIL, Massachusetts Institute of Technology, Cambridge, MA;Computer Science and Engineering, Pennsylvania State University, University Park, PA;Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA
Venue:
Proceedings of Workshop on General Purpose Processing Using GPUs
Year:
2014

Citing 22
Cited 0

A fast algorithm for particle simulations

Journal of Computational Physics
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
The Fast Multipole Algorithm

Computing in Science and Engineering
A kernel-independent adaptive fast multipole algorithm in two and three dimensions

Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Massively parallel implementation of a fast multipole method for distributed memory machines

Journal of Parallel and Distributed Computing
Fast, parallel, GPU-based construction of space filling curves and octrees

Proceedings of the 2008 symposium on Interactive 3D graphics and games
Fast multipole methods on graphics processors

Journal of Computational Physics
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The black-box fast multipole method

Journal of Computational Physics
A massively parallel adaptive fast-multipole method on heterogeneous architectures

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods

Parallel Computing
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Concurrency and Computation: Practice & Experience - Euro-Par 2009
What GPU Computing Means for High-End Systems

IEEE Micro
Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Brief announcement: towards a communication optimal fast multipole method and its implications at exascale

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems

International Journal of High Performance Computing Applications
A Roofline Model of Energy

IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
A Theoretical Framework for Algorithm-Architecture Co-design

IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an optimized CPU--GPU hybrid implementation and a GPU performance model for the kernel-independent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to another highly optimized GPU implementation, our implementation achieves as much as a 1.9× speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the execution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedule that yields the best performance. In the best case, we achieve a speedup of 1.5× compared to our GPU-only implementation, despite the large difference in computational powers of CPUs and GPUs. We comment on one consequence of having such performance models, which is to enable speculative predictions about FMM scalability on future systems.