Analysis of scalable data-privatization threading algorithms for hybrid MPI/OpenMP parallelization of molecular dynamics

  • Authors:
  • Manaschai Kunaseth;David F. Richards;James N. Glosli;Rajiv K. Kalia;Aiichiro Nakano;Priya Vashishta

  • Affiliations:
  • Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089;Lawrence Livermore National Laboratory, Livermore, USA 94550;Lawrence Livermore National Laboratory, Livermore, USA 94550;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089;Collaboratory for Advanced Computing and Simulations, Department of Computer Science, University of Southern California, Los Angeles, USA 90089

  • Venue:
  • The Journal of Supercomputing
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose and analyze threading algorithms for hybrid MPI/OpenMP parallelization of a molecular-dynamics simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation are introduced: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). The algorithms combine fine-grain dynamic load balancing and minimal memory-footprint data privatization threading. We show that the computational costs of CVAS and BFAS are bounded by 驴(n 5/3 p 驴2/3) and 驴(n), respectively, for p threads working on n particles on a multicore compute node. Memory consumption per node of both algorithms scales as O(n+n 2/3 p 1/3), but CVAS has smaller prefactors due to a geometric effect. Based on these analyses, we derive the selection criterion between the two algorithms in terms of the granularity, n/p. We observe that memory consumption is reduced by 75 % for p=16 and n=8,192 compared to a naïve data privatization, while maintaining thread imbalance below 5 %. We obtain a strong-scaling speedup of 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code achieves 2.58脳 and 2.16脳 speedups over the MPI-only implementation on 32,768 cores of BlueGene/P for 0.84 and 1.68 million particle systems, respectively.