Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6

Authors:
Yanhua Sun;Gengbin Zheng;Chao Mei;Eric J. Bohm;James C. Phillips;Laximant V. Kalé;Terry R. Jones
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;University of Illinois at Urbana-Champaign, Urbana, IL;Oak Ridge National Lab, Oak Ridge, TN
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 17
Cited 0

Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
OMPI: optimizing MPI programs using partial evaluation

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
NAMD: biomolecular simulation on thousands of processors

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Co-arrays in the next Fortran Standard

ACM SIGPLAN Fortran Forum
Scalable algorithms for molecular dynamics simulations on commodity clusters

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Blue matter: approaching the limits of concurrency for classical molecular dynamics

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing)

UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing)
Parallel Programmability and the Chapel Language

International Journal of High Performance Computing Applications
Adapting a message-driven parallel application to GPU-accelerated clusters

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Multilevel summation of electrostatic potentials using graphics processing units

Parallel Computing
A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study

ICCS'03 Proceedings of the 2003 international conference on Computational science
Optimizing a parallel runtime system for multicore clusters: a case study

Proceedings of the 2010 TeraGrid Conference
Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems

HOTI '11 Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects
Communication Optimization Beyond MPI

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A uGNI-based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

Achieving good scaling for fine-grained communication intensive applications on modern supercomputers remains challenging. In our previous work, we have shown that such an application --- NAMD --- scales well on the full Jaguar XT5 without long-range interactions; Yet, with them, the speedup falters beyond 64K cores. Although the new Gemini interconnect on Cray XK6 has improved network performance, the challenges remain, and are likely to remain for other such networks as well. We analyze communication bottlenecks in NAMD and its CHARM++ runtime, using the Projections performance analysis tool. Based on the analysis, we optimize the runtime, built on the uGNI library for Gemini. We present several techniques to improve the fine-grained communication. Consequently, the performance of running 92224-atom Apoa1 with GPUs on TitanDev is improved by 36%. For 100-million-atom STMV, we improve upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6.