CHARM++: a portable concurrent object oriented system based on C++
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Scheduling multithreaded computations by work stealing
Journal of the ACM (JACM)
Zoltan Data Management Service for Parallel Dynamic Applications
Computing in Science and Engineering
HPCN Europe 1996 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
lmbench: portable tools for performance analysis
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
The Design and Implementation of a Domain-Specific Language for Network Performance Testing
IEEE Transactions on Parallel and Distributed Systems
Dynamic topology aware load balancing algorithms for molecular dynamics applications
Proceedings of the 23rd international conference on Supercomputing
Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Handling the problems and opportunities posed by multiple on-chip memory controllers
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Near-optimal placement of MPI processes on hierarchical NUMA architectures
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Scheduling task parallelism on multi-socket multicore systems
Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Proceedings of the international symposium on Memory management
Generic topology mapping strategies for large-scale parallel architectures
Proceedings of the international conference on Supercomputing
Periodic hierarchical load balancing for large supercomputers
International Journal of High Performance Computing Applications
Work stealing and persistence-based load balancers for iterative overdecomposed applications
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
NUMA-aware graph mining techniques for performance and energy efficiency
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems
ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems
Hi-index | 0.00 |
In this paper, we present a topology-aware load balancing algorithm for parallel multi-core machines and its proof of asymptotic convergence to an optimal solution. The algorithm, named HwTopoLB, aims to improve the application performance by reducing core idleness and communication delays. HwTopoLB was designed taking into account the properties of current parallel systems composed of multi-core compute nodes, namely their network interconnection, and their complex and hierarchical core topology. The latter comprises multiple levels of cache, and a memory subsystem with NUMA design. These systems provide high processing power at the expense of asymmetric communication costs, which can hamper the performance of parallel applications depending on their communication patterns if ignored. Our load balancing algorithm models asymmetries in terms of latencies and bandwidths, representing the distances and communication costs among hardware components. We have implemented HwTopoLB using the Charm++ Parallel Runtime System and evaluated its performance with two different benchmarks and one application. Our experimental results with HwTopoLB exhibit scalability over clustered multi-core compute nodes, and average performance improvements of 23% over execution without load balancers and 19% over the existing load balancing strategies on different multi-core systems.