An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Authors:
G. Chen;L. ChacóN;D. C. Barnes
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Coronado Consulting, Lamy, NM 87540, USA
Venue:
Journal of Computational Physics
Year:
2012

Citing 21
Cited 3

A general concurrent algorithm for plasma particle-in-cell simulation codes

Journal of Computational Physics
Hitting the memory wall: implications of the obvious

ACM SIGARCH Computer Architecture News
Accelerating a paricle -in-cell simulation using a hybrid counting sort

Journal of Computational Physics
Computer Organization and Design

Computer Organization and Design
Error Estimation and Control for ODEs

Journal of Scientific Computing
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

Journal of Parallel and Distributed Computing
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
The Concurrency Challenge

IEEE Design & Test
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Compute Unified Device Architecture Application Suitability

Computing in Science and Engineering
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The GPU Computing Era

IEEE Micro
Programming Massively Parallel Processors: A Hands-on Approach

Programming Massively Parallel Processors: A Hands-on Approach
Particle-in-cell simulations with charge-conserving current deposition on graphic processing units

Journal of Computational Physics
Fermi GF100 GPU Architecture

IEEE Micro
The Future of Computing Performance: Game Over or Next Level?

The Future of Computing Performance: Game Over or Next Level?
Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Parallel Computing
A Novel Sorting Algorithm for Many-core Architectures Based on Adaptive Bitonic Sort

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Short Note: A charge- and energy-conserving implicit, electrostatic particle-in-cell algorithm on mapped computational meshes

Journal of Computational Physics
Radiative signatures of the relativistic Kelvin-Helmholtz instability

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fluid preconditioning for Newton-Krylov-based, fully implicit, electrostatic particle-in-cell simulations

Journal of Computational Physics

Quantified Score

Hi-index	31.46

Visualization

Abstract

Recently, an implicit, nonlinearly consistent, energy- and charge-conserving one-dimensional (1D) particle-in-cell method has been proposed for multi-scale, full-f kinetic simulations [G. Chen et al., J. Comput. Phys. 230 (18) (2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver, capable of using very large timesteps without loss of numerical stability or accuracy. A fundamental feature of the method is the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent. This paper describes a very efficient, mixed-precision hybrid CPU-GPU implementation of the 1D implicit PIC algorithm exploiting this feature. The JFNK solver is kept on the CPU in double precision (DP), while the implicit, charge-conserving, and adaptive particle mover is implemented on a GPU (graphics processing unit) using CUDA in single-precision (SP). Performance-oriented optimizations are introduced with the aid of the roofline model. The implicit particle mover algorithm is shown to achieve up to 400GOp/s on a Nvidia GeForce GTX580. This corresponds to 25% absolute GPU efficiency against the peak theoretical performance, and is about 100 times faster than an equivalent single-core CPU (Intel Xeon X5460) compiler-optimized execution. For the test case chosen, the mixed-precision hybrid CPU-GPU solver is shown to over-perform the DP CPU-only serial version by a factor of ~100, without apparent loss of robustness or accuracy in a challenging long-timescale ion acoustic wave simulation.