Analysis of scalable data-privatization threading algorithms for hybrid MPI/OpenMP parallelization of molecular dynamics

Authors:
Manaschai Kunaseth;David F. Richards;James N. Glosli;Rajiv K. Kalia;Aiichiro Nakano;Priya Vashishta
Affiliations:
Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089;Lawrence Livermore National Laboratory, Livermore, USA 94550;Lawrence Livermore National Laboratory, Livermore, USA 94550;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089
Venue:
The Journal of Supercomputing
Year:
2013

Citing 19
Cited 0

Computer simulation using particles

Computer simulation using particles
Improving Memory Hierarchy Performance for Irregular Applications Using Data and Computation Reorderings

International Journal of Parallel Programming
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Zonal methods for the parallel execution of range-limited N-body simulations

Journal of Computational Physics
GPU accelerated molecular dynamics simulation of thermal conductivities

Journal of Computational Physics
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Hybrid Message-Passing and Shared-Memory Programming in a Molecular Dynamics Application On Multicore Clusters

International Journal of High Performance Computing Applications
A metascalable computing framework for large spatiotemporal-scale atomistic simulations

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Millisecond-scale molecular dynamics simulations on Anton

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Memory-efficient optimization of Gyrokinetic particle-to-grid interpolation for multicore processors

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Beyond homogeneous decomposition: scaling long-range forces on Massively Parallel Systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms

ICPPW '09 Proceedings of the 2009 International Conference on Parallel Processing Workshops
GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows

Journal of Computational Physics
Dynamic multi phase scheduling for heterogeneous cluste

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Computing Performance: Game Over or Next Level?

Computer
Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

The Journal of Supercomputing
Improving Performance on Atmospheric Models through a Hybrid OpenMP/MPI Implementation

ISPA '11 Proceedings of the 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications
Algorithm Design

Algorithm Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose and analyze threading algorithms for hybrid MPI/OpenMP parallelization of a molecular-dynamics simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation are introduced: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). The algorithms combine fine-grain dynamic load balancing and minimal memory-footprint data privatization threading. We show that the computational costs of CVAS and BFAS are bounded by 驴(n 5/3 p 驴2/3) and 驴(n), respectively, for p threads working on n particles on a multicore compute node. Memory consumption per node of both algorithms scales as O(n+n 2/3 p 1/3), but CVAS has smaller prefactors due to a geometric effect. Based on these analyses, we derive the selection criterion between the two algorithms in terms of the granularity, n/p. We observe that memory consumption is reduced by 75 % for p=16 and n=8,192 compared to a naïve data privatization, while maintaining thread imbalance below 5 %. We obtain a strong-scaling speedup of 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code achieves 2.58脳 and 2.16脳 speedups over the MPI-only implementation on 32,768 cores of BlueGene/P for 0.84 and 1.68 million particle systems, respectively.