Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Authors:
Kamesh Madduri;Samuel Williams;Stéphane Ethier;Leonid Oliker;John Shalf;Erich Strohmaier;Katherine Yelicky
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Princeton Plasma Physics Laboratory, Princeton, NJ;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of California at Berkeley, Berkeley, CA
Venue:
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Year:
2009

Citing 8
Cited 8

Gyrokinetic particle simulation model

Journal of Computational Physics
Parallelization issues and particle-in-cell codes

Parallelization issues and particle-in-cell codes
Accelerating a paricle -in-cell simulation using a hybrid counting sort

Journal of Computational Physics
Scientific Computations on Modern Parallel Vector Systems

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Large-scale gyrokinetic particle simulation of microturbulence in magnetically confined fusion plasmas

IBM Journal of Research and Development
Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

Journal of Parallel and Distributed Computing
0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Overlapping communication with computation using OpenMP tasks on the GTS magnetic fusion code

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Hybrid PGAS runtime support for multicore nodes

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

Parallel Computing
An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algorithm

Journal of Computational Physics
Kinetic turbulence simulations at extreme scale on leadership-class systems

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Analysis of scalable data-privatization threading algorithms for hybrid MPI/OpenMP parallelization of molecular dynamics

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present multicore parallelization strategies for the particle-to-grid interpolation step in the Gyrokinetic Toroidal Code (GTC), a 3D particle-in-cell (PIC) application to study turbulent transport in magnetic-confinement fusion devices. Particle-grid interpolation is a known performance bottleneck in several PIC applications. In GTC, this step involves particles depositing charges to a 3D toroidal mesh, and multiple particles may contribute to the charge at a grid point. We design new parallel algorithms for the GTC charge deposition kernel, and analyze their performance on three leading multicore platforms. We implement thirteen different variants for this kernel and identify the best-performing ones given typical PIC parameters such as the grid size, number of particles per cell, and the GTC-specific particle Larmor radius variation. We find that our best strategies can be 2x faster than the reference optimized MPI implementation, and our analysis provides insight into desirable architectural features for high-performance PIC simulation codes.