A fast algorithm for particle simulations
Journal of Computational Physics
A modified tree code: don't laugh; it runs
Journal of Computational Physics
Astrophysical N-body simulations using hierarchical tree data structures
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Astrophysical N-body simulations on GRAPE-4 special-purpose computer
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
$7.0/Mflops astrophysical N-body simulation with treecode on GRAPE-5
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
N-body simulation of galaxy formation on GRAPE-4 special-purpose computer
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A 1.349 Tflops simulation of black holes in a galactic center on GRAPE-6
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A 29.5 Tflops simulation of planetesimals in Uranus-Neptune region on GRAPE-6
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance evaluation and tuning of GRAPE-6 - towards 40 "real" Tflops
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Calculation of isotropic turbulence using a pure Lagrangian vortex method
Journal of Computational Physics
Fast multipole methods on graphics processors
Journal of Computational Physics
Speeding up N-body Calculations on Machines without Hardware Square Root
Scientific Programming
190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling Hierarchical N-body Simulations on GPU Clusters
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 48th Design Automation Conference
Enhancing locality for recursive traversals of recursive structures
Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable fast multipole methods on distributed heterogeneous architectures
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Implementation of a hierarchical N-body simulator using the Ompss programming model
Proceedings of the first workshop on Irregular applications: architectures and algorithm
A massively parallel adaptive fast multipole method on heterogeneous architectures
Communications of the ACM
Scaling fast multipole methods up to 4000 GPUs
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
4.45 Pflops astrophysical N-body simulation on K computer: the gravitational trillion-body problem
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems
International Journal of High Performance Computing Applications
A peta-scalable CPU-GPU algorithm for global atmospheric simulations
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel time-space processing model based fast N-body simulation on GPUs
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Computational physics on graphics processing units
PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
On the efficacy of GPU-integrated MPI for scientific applications
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
An (almost) direct deployment of the Fast Multipole Method on the Cell processor
The Journal of Supercomputing
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method
Proceedings of Workshop on General Purpose Processing Using GPUs
Hi-index | 0.02 |
As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application --a gravitational N-body simulation-- and one non-standard application --simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops. The vortex particle simulation of homogeneous isotropic turbulence using the periodic FMM with 16,777,216 particles showed a sustained performance of 20.2 TFlops. The overall cost of the hardware was 228,912 dollars. The maximum corrected performance is 28.1TFlops for the gravitational simulation, which results in a cost performance of 124 MFlops/$. This correction is performed by counting the Flops based on the most efficient CPU algorithm. Any extra Flops that arise from the GPU implementation and parameter differences are not included in the 124 MFlops/$.