Optimizing a parallel runtime system for multicore clusters: a case study

Authors:
Chao Mei;Gengbin Zheng;Filippo Gioachin;Laxmikant V. Kalé
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Proceedings of the 2010 TeraGrid Conference
Year:
2010

Citing 15
Cited 2

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Cilk: an efficient multithreaded runtime system

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms

PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
UPC performance and potential: a NPB experimental study

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Design and Evaluation of Nemesis, a Scalable, Low-Latency, Message-Passing Communication Subsystem

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Multiple Flows of Control in Migratable Parallel Programs

ICPPW '06 Proceedings of the 2006 International Conference Workshops on Parallel Processing
Development of mixed mode MPI / OpenMP applications

Scientific Programming
Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System

CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Fine-grained parallelization of the Car-Parrinello ab initio molecular dynamics method on the IBM Blue Gene/L supercomputer

IBM Journal of Research and Development
Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Optimization of a lattice Boltzmann computation on state-of-the-art multicore platforms

Journal of Parallel and Distributed Computing
Performance Evaluation of MPI, UPC and OpenMP on Multicore Architectures

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Exploiting Direct Access Shared Memory for MPI On Multi-Core Processors

International Journal of High Performance Computing Applications

Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimizing fine-grained communication in a biomolecular simulation application on Cray XK6

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clusters of multicore nodes have become the most popular option for new HPC systems due to their scalability and performance/cost ratio. The complexity of programming multicore systems underscores the need for powerful and efficient runtime systems that manage resources such as threads and communication sub-systems on behalf of the applications. In this paper, we study several multicore performance issues on clusters using Intel, AMD and IBM processors in the context of the Charm++ runtime system. We then present the optimization techniques that overcome these performance issues. The techniques presented are general enough to apply to other runtime systems as well. We demonstrate the benefits of these optimizations through both synthetic benchmarks and production quality applications including NAMD and ChaNGa on several popular multicore platforms. We demonstrate performance improvement of NAMD and ChaNGa by about 20% and 10%, respectively.