42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence

Authors:
Tsuyoshi Hamada;Tetsu Narumi;Rio Yokota;Kenji Yasuoka;Keigo Nitadori;Makoto Taiji
Affiliations:
Nagasaki University, Nagasaki, Japan;University of Electro-Communications, Tokyo, Japan;University of Bristol, Bristol, United Kingdom;Keio University, Yokohama, Japan;RIKEN Advanced Science Institute, Wako, Japan;RIKEN Advanced Science Institute, Wako, Japan
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 16
Cited 19

A fast algorithm for particle simulations

Journal of Computational Physics
A modified tree code: don't laugh; it runs

Journal of Computational Physics
Astrophysical N-body simulations using hierarchical tree data structures

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Astrophysical N-body simulations on GRAPE-4 special-purpose computer

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
$7.0/Mflops astrophysical N-body simulation with treecode on GRAPE-5

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
N-body simulation of galaxy formation on GRAPE-4 special-purpose computer

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A 1.349 Tflops simulation of black holes in a galactic center on GRAPE-6

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A 29.5 Tflops simulation of planetesimals in Uranus-Neptune region on GRAPE-6

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance evaluation and tuning of GRAPE-6 - towards 40 "real" Tflops

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
$158/GFLOPS astrophysical N-body simulation with reconfigurable add-in card and hierarchical tree algorithm

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
A 55 TFLOPS simulation of amyloid-forming peptides from yeast prion Sup35 with the special-purpose computer system MDGRAPE-3

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scan primitives for GPU computing

Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Calculation of isotropic turbulence using a pure Lagrangian vortex method

Journal of Computational Physics
Fast multipole methods on graphics processors

Journal of Computational Physics
Speeding up N-body Calculations on Machines without Hardware Square Root

Scientific Programming

190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Diagnosis, Tuning, and Redesign for Multicore Performance: A Case Study of the Fast Multipole Method

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scaling Hierarchical N-body Simulations on GPU Clusters

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Fast multipole method on GPU: tackling 3-D capacitance extraction on massively parallel SIMD platforms

Proceedings of the 48th Design Automation Conference
Enhancing locality for recursive traversals of recursive structures

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable fast multipole methods on distributed heterogeneous architectures

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Implementation of a hierarchical N-body simulator using the Ompss programming model

Proceedings of the first workshop on Irregular applications: architectures and algorithm
A massively parallel adaptive fast multipole method on heterogeneous architectures

Communications of the ACM
Scaling fast multipole methods up to 4000 GPUs

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
4.45 Pflops astrophysical N-body simulation on K computer: the gravitational trillion-body problem

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems

International Journal of High Performance Computing Applications
A peta-scalable CPU-GPU algorithm for global atmospheric simulations

Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel time-space processing model based fast N-body simulation on GPUs

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Computational physics on graphics processing units

PARA'12 Proceedings of the 11th international conference on Applied Parallel and Scientific Computing
On the efficacy of GPU-integrated MPI for scientific applications

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
An (almost) direct deployment of the Fast Multipole Method on the Cell processor

The Journal of Supercomputing
A CPU: GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.02

Visualization

Abstract

As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application --a gravitational N-body simulation-- and one non-standard application --simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops. The vortex particle simulation of homogeneous isotropic turbulence using the periodic FMM with 16,777,216 particles showed a sustained performance of 20.2 TFlops. The overall cost of the hardware was 228,912 dollars. The maximum corrected performance is 28.1TFlops for the gravitational simulation, which results in a cost performance of 124 MFlops/$. This correction is performed by counting the Flops based on the most efficient CPU algorithm. Any extra Flops that arise from the GPU implementation and parameter differences are not included in the 124 MFlops/$.