Global arrays: a nonuniform memory access programming model for high-performance computers
The Journal of Supercomputing
OMPI: optimizing MPI programs using partial evaluation
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
NAMD: biomolecular simulation on thousands of processors
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Co-arrays in the next Fortran Standard
ACM SIGPLAN Fortran Forum
Scalable algorithms for molecular dynamics simulations on commodity clusters
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Blue matter: approaching the limits of concurrency for classical molecular dynamics
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing)
UPC: Distributed Shared Memory Programming (Wiley Series on Parallel and Distributed Computing)
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Adapting a message-driven parallel application to GPU-accelerated clusters
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A 32x32x32, spatially distributed 3D FFT in four microseconds on Anton
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Scaling molecular dynamics to 3000 processors with projections: a performance analysis case study
ICCS'03 Proceedings of the 2003 international conference on Computational science
Optimizing a parallel runtime system for multicore clusters: a case study
Proceedings of the 2010 TeraGrid Conference
Evaluating the Potential of Cray Gemini Interconnect for PGAS Communication Runtime Systems
HOTI '11 Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects
Communication Optimization Beyond MPI
IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Hi-index | 0.00 |
Achieving good scaling for fine-grained communication intensive applications on modern supercomputers remains challenging. In our previous work, we have shown that such an application --- NAMD --- scales well on the full Jaguar XT5 without long-range interactions; Yet, with them, the speedup falters beyond 64K cores. Although the new Gemini interconnect on Cray XK6 has improved network performance, the challenges remain, and are likely to remain for other such networks as well. We analyze communication bottlenecks in NAMD and its CHARM++ runtime, using the Projections performance analysis tool. Based on the analysis, we optimize the runtime, built on the uGNI library for Gemini. We present several techniques to improve the fine-grained communication. Consequently, the performance of running 92224-atom Apoa1 with GPUs on TitanDev is improved by 36%. For 100-million-atom STMV, we improve upon the prior Jaguar XT5 result of 26 ms/step to 13 ms/step using 298,992 cores on Jaguar XK6.