A fast algorithm for particle simulations
Journal of Computational Physics
A modified tree code: don't laugh; it runs
Journal of Computational Physics
Astrophysical N-body simulations using hierarchical tree data structures
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Astrophysical N-body simulations on GRAPE-4 special-purpose computer
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
$7.0/Mflops astrophysical N-body simulation with treecode on GRAPE-5
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
N-body simulation of galaxy formation on GRAPE-4 special-purpose computer
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
A 1.349 Tflops simulation of black holes in a galactic center on GRAPE-6
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Avalon: an Alpha/Linux cluster achieves 10 Gflops for $15k
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A 29.5 Tflops simulation of planetesimals in Uranus-Neptune region on GRAPE-6
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance evaluation and tuning of GRAPE-6 - towards 40 "real" Tflops
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Scan primitives for GPU computing
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Fast multipole methods on graphics processors
Journal of Computational Physics
Speeding up N-body Calculations on Machines without Hardware Square Root
Scientific Programming
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Study of hierarchical n-body methods for network-on-chip architectures
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Scaling fast multipole methods up to 4000 GPUs
Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
Towards Symmetric Multi-threaded Optimistic Simulation Kernels
PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
4.45 Pflops astrophysical N-body simulation on K computer: the gravitational trillion-body problem
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
GRAPE-8: an accelerator for gravitational N-body simulation with 20.5Gflops/W performance
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Load sharing for optimistic parallel simulations on multi core machines
ACM SIGMETRICS Performance Evaluation Review
A peta-scalable CPU-GPU algorithm for global atmospheric simulations
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Hi-index | 0.00 |
We present the results of a hierarchical N-body simulation on DEGIMA, a cluster of PCs with 576 graphic processing units (GPUs) and using an InfiniBand interconnect. DEGIMA stands for DEstination for GPU Intensive MAchine, and is located at Nagasaki Advanced Computing Center (NACC), Nagasaki University. In this work, we have upgraded DEGIMA_s interconnect using InfiniBand. DEGIMA is composed by 144 nodes with 576 GT200 GPUs. An astrophysical N-body simulation with 3,278,982,596 particles using a treecode algorithm shows a sustained performance of 190.5 Tflops on DEGIMA. The overall cost of the hardware was $411,921 dollars. The maximum corrected performance is 104.8 Tflops for the simulation, resulting in a cost performance of 254.4 MFlops/$. This corrections is performed by counting the FLOPS based on the most efficient CPU algorithm. Any extra FLOPS that arise from the GPU implementation and parameter differences are not included in the 254.4 MFLOPS/$.