A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Authors:
Laércio L. Pilla;Christiane P. Ribeiro;Pierre Coucheney;François Broquedis;Bruno Gaujal;Philippe O. A. Navaux;Jean-François Méhaut
Affiliations:
-;-;-;-;-;-;-
Venue:
Future Generation Computer Systems
Year:
2014

Citing 19
Cited 0

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Scheduling multithreaded computations by work stealing

Journal of the ACM (JACM)
Zoltan Data Management Service for Parallel Dynamic Applications

Computing in Science and Engineering
SCOTCH: A Software Package for Static Mapping by Dual Recursive Bipartitioning of Process and Architecture Graphs

HPCN Europe 1996 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
lmbench: portable tools for performance analysis

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
The Design and Implementation of a Domain-Specific Language for Network Performance Testing

IEEE Transactions on Parallel and Distributed Systems
Dynamic topology aware load balancing algorithms for molecular dynamics applications

Proceedings of the 23rd international conference on Supercomputing
Towards an Efficient Process Placement Policy for MPI Applications in Multicore Environments

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Handling the problems and opportunities posed by multiple on-chip memory controllers

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Near-optimal placement of MPI processes on hierarchical NUMA architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Scheduling task parallelism on multi-socket multicore systems

Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
Generic topology mapping strategies for large-scale parallel architectures

Proceedings of the international conference on Supercomputing
Periodic hierarchical load balancing for large supercomputers

International Journal of High Performance Computing Applications
Work stealing and persistence-based load balancers for iterative overdecomposed applications

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
NUMA-aware graph mining techniques for performance and energy efficiency

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Asymptotically Optimal Load Balancing for Hierarchical Multi-Core Systems

ICPADS '12 Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a topology-aware load balancing algorithm for parallel multi-core machines and its proof of asymptotic convergence to an optimal solution. The algorithm, named HwTopoLB, aims to improve the application performance by reducing core idleness and communication delays. HwTopoLB was designed taking into account the properties of current parallel systems composed of multi-core compute nodes, namely their network interconnection, and their complex and hierarchical core topology. The latter comprises multiple levels of cache, and a memory subsystem with NUMA design. These systems provide high processing power at the expense of asymmetric communication costs, which can hamper the performance of parallel applications depending on their communication patterns if ignored. Our load balancing algorithm models asymmetries in terms of latencies and bandwidths, representing the distances and communication costs among hardware components. We have implemented HwTopoLB using the Charm++ Parallel Runtime System and evaluated its performance with two different benchmarks and one application. Our experimental results with HwTopoLB exhibit scalability over clustered multi-core compute nodes, and average performance improvements of 23% over execution without load balancers and 19% over the existing load balancing strategies on different multi-core systems.