Performance analysis and optimization of molecular dynamics simulation on Godson-T many-core processor

Authors:
Liu Peng;Aiichiro Nakano;Guangming Tan;Priya Vashishta;Dongrui Fan;Hao Zhang;Rajiv K. Kalia;Fenglong Song
Affiliations:
Chinese Academy of Sciences, Beijing, China and University of Southern California, Los Angeles, CA;University of Southern California, Los Angeles, CA;Chinese Academy of Sciences, Beijing, China;University of Southern California, Los Angeles, CA;Chinese Academy of Sciences, Beijing, China;Chinese Academy of Sciences, Beijing, China;University of Southern California, Los Angeles, CA;Chinese Academy of Sciences, Beijing, China
Venue:
Proceedings of the 8th ACM International Conference on Computing Frontiers
Year:
2011

Citing 16
Cited 0

Computer simulation of liquids

Computer simulation of liquids
Fast parallel algorithms for short-range molecular dynamics

Journal of Computational Physics
Demonstrating the scalability of a molecular dynamics application on a Petaflop computer

ICS '01 Proceedings of the 15th international conference on Supercomputing
Scalable atomistic simulation algorithms for materials research

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The cache complexity of multithreaded cache oblivious algorithms

Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Preliminary investigation of advanced electrostatics in molecular dynamics on reconfigurable computers

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Anton, a special-purpose machine for molecular dynamics simulation

Proceedings of the 34th annual international symposium on Computer architecture
Executing irregular scientific applications on stream architectures

Proceedings of the 21st annual international conference on Supercomputing
A 281 Tflops calculation for X-ray protein structure analysis with special-purpose computers MDGRAPE-3

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Dynamic topology aware load balancing algorithms for molecular dynamics applications

Proceedings of the 23rd international conference on Supercomputing
Study on Fine-Grained Synchronization in Many-Core Architecture

SNPD '09 Proceedings of the 2009 10th ACIS International Conference on Software Engineering, Artificial Intelligences, Networking and Parallel/Distributed Computing
A Low-Complexity Synchronization Based Cache Coherence Solution for Many Cores

CIT '09 Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

Molecular dynamics (MD) simulation has broad applications, but its irregular memory-access pattern makes performance optimization a challenge. This paper presents a joint application/architecture study to enhance on-chip parallelism of MD on Godson-T -like many-core architecture. First, a preprocessing leveraging an adaptive divide-and-conquer framework is designed to exploit locality through memory hierarchy with software controlled memory. Then we propose three incremental optimization strategies: (1) a novel data-layout to re-organize linked-list cell data structures to improve data locality; (2) an on-chip locality-aware parallel algorithm to enhance data reuse; and (3) a pipelining algorithm to hide latency to shared memory. Experiments on Godson-T simulator exhibit strong-scaling parallel efficiency 0.99 on 64 cores, which is confirmed by an FPGA emulator. Detailed analysis shows that optimizations utilizing architectural features to maximize data locality and to enhance data reuse benefit scalability most. Furthermore, a simple performance model suggests that the optimization scheme is likely to scale well toward exascale. Certain architectural features are found essential for these optimizations, which could guide future hardware developments.