Exploiting hierarchical parallelisms for molecular dynamics simulation on multicore clusters

  • Authors:
  • Liu Peng;Manaschai Kunaseth;Hikmet Dursun;Ken-Ichi Nomura;Weiqiang Wang;Rajiv K. Kalia;Aiichiro Nakano;Priya Vashishta

  • Affiliations:
  • Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242;Collaboratory for Advanced Computing and Simulations (CACS), University of Southern California, Los Angeles, USA 90089-0242

  • Venue:
  • The Journal of Supercomputing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have developed a scalable hierarchical parallelization scheme for molecular dynamics (MD) simulation on multicore clusters. The scheme explores multilevel parallelism combining: (1) Internode parallelism using spatial decomposition via message passing; (2) intercore parallelism using cellular decomposition via multithreading employing a master/worker model; (3) data-level optimization via single-instruction multiple-data (SIMD) parallelism with various code transformation techniques. By using a hierarchy of parallelisms, the scheme exposes very high concurrency and data locality, thereby achieving: (1) internode weak-scaling parallel efficiency 0.985 on 106,496 BlueGene/L nodes (0.975 on 32,768 BlueGene/P nodes), internode strong-scaling parallel efficiency 0.90 on 8,192 BlueGene/L nodes; (2) intercore multithread parallel efficiency 0.65 for eight threads on a dual quadcore Xeon platform; and (3) SIMD speedup around 2 for problem sizes ranging from 3,072 to 98,304 atoms. Furthermore, the effect of memory-access penalty on SIMD performance is analyzed, and an application-based SIMD analysis scheme is proposed to help programmers determine whether their applications are amenable to SIMDization.