Performance evaluation of adaptive MPI
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
HPC-Colony: services and interfaces for very large systems
ACM SIGOPS Operating Systems Review
Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
ACM SIGOPS Operating Systems Review
Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system
IBM Journal of Research and Development
PADTAD '08 Proceedings of the 6th workshop on Parallel and distributed systems: testing, analysis, and debugging
A Case Study in Tightly Coupled Multi-paradigm Parallel Programming
Languages and Compilers for Parallel Computing
Dynamic topology aware load balancing algorithms for molecular dynamics applications
Proceedings of the 23rd international conference on Supercomputing
Scalable cosmological simulations on parallel machines
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
A new technique for data privatization in user-level threads and its use in parallel applications
Proceedings of the 2010 ACM Symposium on Applied Computing
Support for adaptivity in ARMCI using migratable objects
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Topology-aware task mapping for reducing communication contention on large parallel machines
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Empirical Study on Entity Interaction Graph of Large-Scale Parallel Simulations
PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
A 'cool' load balancer for parallel applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Periodic hierarchical load balancing for large supercomputers
International Journal of High Performance Computing Applications
Preserving the original MPI semantics in a virtualized processor environment
Science of Computer Programming
Steal Tree: low-overhead tracing of work stealing schedulers
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A 'cool' way of improving the reliability of HPC machines
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Parallel machines with an extremely large number of processors are now in operation. For example, the IBM BlueGene/L machine with 128K processors is currently being deployed. It is going to be a significant challenge for application developers to write parallel programs in order to exploit the enormous compute power available and manually scale their applications on such machines. Solving these problems involves finding suitable parallel programming models for such machines and addressing issues like load imbalance. This thesis explores processor virtualization in Charm++ programming model and employing migratable objects for programming petaflops class machines supported by parallel emulation for algorithm validation, parallel simulation for performance prediction, and using new kinds of automatic load balancing strategies to substantially address many of these challenges for programming very large machines. It is important to understand the performance of parallel applications on very large parallel machines. This thesis explores Parallel Discrete Event Simulation techniques to simulate parallel applications and predict their performance. We present a novel optimistic synchronization protocol which exploits the inherent determinacy in parallel applications to effectively reduce the synchronization overhead. Load balance problem presents significant challenges to applications to achieve scalability on very large machines. We study load balancing techniques and develop a spectrum of load balancing strategies motivated by several real-world applications. We optimize our load balancing strategies in multiple dimensions of criteria such as communication-aware load balancing, sub-step load balancing, and computation phase-aware load balancing. We have successfully scaled NAMD (a classical molecular dynamics application) to 1TF of peak performance on 3000 processors of PSC Lemieux, using the load balancing framework presented in this thesis. We further motivate the need for next generation load balancing strategies for petaflops class machines. We explore a novel design of a scalable hierarchical load balancing scheme, which incorporates an explicit memory cost control function to make it easy to adapt to extremely large machines with small memory footprint. This hierarchical load balancing scheme builds load data from instrumenting an application automatically at run-time on both computation and communication pattern. The load balancing strategy takes application communication pattern into account explicitly.