A general concurrent algorithm for plasma particle-in-cell simulation codes
Journal of Computational Physics
Hitting the memory wall: implications of the obvious
ACM SIGARCH Computer Architecture News
Accelerating a paricle -in-cell simulation using a hybrid counting sort
Journal of Computational Physics
Computer Organization and Design
Computer Organization and Design
Error Estimation and Control for ODEs
Journal of Scientific Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU
Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
IEEE Design & Test
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
Compute Unified Device Architecture Application Suitability
Computing in Science and Engineering
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
IEEE Micro
Programming Massively Parallel Processors: A Hands-on Approach
Programming Massively Parallel Processors: A Hands-on Approach
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units
Journal of Computational Physics
IEEE Micro
The Future of Computing Performance: Game Over or Next Level?
The Future of Computing Performance: Game Over or Next Level?
Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Journal of Computational Physics
Radiative signatures of the relativistic Kelvin-Helmholtz instability
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Journal of Computational Physics
Hi-index | 31.46 |
Recently, an implicit, nonlinearly consistent, energy- and charge-conserving one-dimensional (1D) particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen et al., J. Comput. Phys. 230 (18) (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the 1D implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 100 times faster than an equivalent single-core CPU (Intel Xeon X5460) compiler-optimized execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of ~100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.