A fast algorithm for particle simulations
Journal of Computational Physics
A parallel hashed Oct-Tree N-body algorithm
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Computing in Science and Engineering
A kernel-independent adaptive fast multipole algorithm in two and three dimensions
Journal of Computational Physics
A New Parallel Kernel-Independent Fast Multipole Method
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Massively parallel implementation of a fast multipole method for distributed memory machines
Journal of Parallel and Distributed Computing
Fast, parallel, GPU-based construction of space filling curves and octrees
Proceedings of the 2008 symposium on Interactive 3D graphics and games
Fast multipole methods on graphics processors
Journal of Computational Physics
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The black-box fast multipole method
Journal of Computational Physics
A massively parallel adaptive fast-multipole method on heterogeneous architectures
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Concurrency and Computation: Practice & Experience - Euro-Par 2009
Scalable fast multipole methods on distributed heterogeneous architectures
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems
International Journal of High Performance Computing Applications
IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
A Theoretical Framework for Algorithm-Architecture Co-design
IPDPS '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
Hi-index | 0.00 |
This paper presents an optimized CPU--GPU hybrid implementation and a GPU performance model for the kernel-independent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to another highly optimized GPU implementation, our implementation achieves as much as a 1.9× speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the execution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedule that yields the best performance. In the best case, we achieve a speedup of 1.5× compared to our GPU-only implementation, despite the large difference in computational powers of CPUs and GPUs. We comment on one consequence of having such performance models, which is to enable speculative predictions about FMM scalability on future systems.