Achieving high performance on extremely large parallel machines: performance prediction and load balancing

Authors:
Laxmikant V. Kale;Gengbin Zheng
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
Achieving high performance on extremely large parallel machines: performance prediction and load balancing
Year:
2005

Citing 0
Cited 17

Performance evaluation of adaptive MPI

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
HPC-Colony: services and interfaces for very large systems

ACM SIGOPS Operating Systems Review
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

ACM SIGOPS Operating Systems Review
Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system

IBM Journal of Research and Development
Memory tagging in Charm++

PADTAD '08 Proceedings of the 6th workshop on Parallel and distributed systems: testing, analysis, and debugging
A Case Study in Tightly Coupled Multi-paradigm Parallel Programming

Languages and Compilers for Parallel Computing
Dynamic topology aware load balancing algorithms for molecular dynamics applications

Proceedings of the 23rd international conference on Supercomputing
Scalable cosmological simulations on parallel machines

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
A new technique for data privatization in user-level threads and its use in parallel applications

Proceedings of the 2010 ACM Symposium on Applied Computing
Support for adaptivity in ARMCI using migratable objects

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Topology-aware task mapping for reducing communication contention on large parallel machines

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Empirical Study on Entity Interaction Graph of Large-Scale Parallel Simulations

PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
A 'cool' load balancer for parallel applications

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Periodic hierarchical load balancing for large supercomputers

International Journal of High Performance Computing Applications
Preserving the original MPI semantics in a virtualized processor environment

Science of Computer Programming
Steal Tree: low-overhead tracing of work stealing schedulers

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A 'cool' way of improving the reliability of HPC machines

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel machines with an extremely large number of processors are now in operation. For example, the IBM BlueGene/L machine with 128K processors is currently being deployed. It is going to be a significant challenge for application developers to write parallel programs in order to exploit the enormous compute power available and manually scale their applications on such machines. Solving these problems involves finding suitable parallel programming models for such machines and addressing issues like load imbalance. This thesis explores processor virtualization in Charm++ programming model and employing migratable objects for programming petaflops class machines supported by parallel emulation for algorithm validation, parallel simulation for performance prediction, and using new kinds of automatic load balancing strategies to substantially address many of these challenges for programming very large machines. It is important to understand the performance of parallel applications on very large parallel machines. This thesis explores Parallel Discrete Event Simulation techniques to simulate parallel applications and predict their performance. We present a novel optimistic synchronization protocol which exploits the inherent determinacy in parallel applications to effectively reduce the synchronization overhead. Load balance problem presents significant challenges to applications to achieve scalability on very large machines. We study load balancing techniques and develop a spectrum of load balancing strategies motivated by several real-world applications. We optimize our load balancing strategies in multiple dimensions of criteria such as communication-aware load balancing, sub-step load balancing, and computation phase-aware load balancing. We have successfully scaled NAMD (a classical molecular dynamics application) to 1TF of peak performance on 3000 processors of PSC Lemieux, using the load balancing framework presented in this thesis. We further motivate the need for next generation load balancing strategies for petaflops class machines. We explore a novel design of a scalable hierarchical load balancing scheme, which incorporates an explicit memory cost control function to make it easy to adapt to extremely large machines with small memory footprint. This hierarchical load balancing scheme builds load data from instrumenting an application automatically at run-time on both computation and communication pattern. The load balancing strategy takes application communication pattern into account explicitly.