hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications

Authors:
Francois Broquedis;Jérôme Clet-Ortega;Stéphanie Moreaud;Nathalie Furmento;Brice Goglin;Guillaume Mercier;Samuel Thibault;Raymond Namyst
Affiliations:
-;-;-;-;-;-;-;-
Venue:
PDP '10 Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing
Year:
2010

Citing 0
Cited 33

Near-optimal placement of MPI processes on hierarchical NUMA architectures

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Adaptive MPI multirail tuning for non-uniform input/output access

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Locality and topology aware intra-node communication among multicore CPUs

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Enabling locality-aware computations in OpenMP

Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Parallel memory prediction for fused linear algebra kernels

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method

International Journal of High Performance Computing Applications
Towards NUMA support with distance information

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Improving MPI applications performance on multicore clusters with rank reordering

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Multi-core and network aware MPI topology functions

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Adaptive parallel approximate similarity search for responsive multimedia retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
DAGuE: A generic distributed DAG engine for High Performance Computing

Parallel Computing
Using explicit platform descriptions to support programming of heterogeneous many-core systems

Parallel Computing
Automatic mapping of parallel applications on multicore architectures using the Servet benchmark suite

Computers and Electrical Engineering
Automatic NUMA characterization using Cbench

ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Integrated in-system storage architecture for high performance computing

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
The design of OpenMP thread affinity

IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
The impact of heterogeneous multi-core clusters on graph partitioning: an empirical study

Cluster Computing
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dynamic thread mapping based on machine learning for transactional memory applications

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Delegation-Based MPI communications for a hybrid parallel computer with many-core architecture

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework

Journal of Parallel and Distributed Computing
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Improving performance of openSHMEM reference library by portable PE mapping technique

Proceedings of the 27th international ACM conference on International conference on supercomputing
Advancing application process affinity experimentation: open MPI's LAMA-based affinity interface

Proceedings of the 20th European MPI Users' Group Meeting
An implementation of the codelet model

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Ada and many-core platforms

ACM SIGAda Ada Letters
ReSense: Mapping dynamic workloads of colocated multithreaded applications using resource sensitivity

ACM Transactions on Architecture and Code Optimization (TACO)
A topology-aware load balancing algorithm for clustered hierarchical multi-core machines

Future Generation Computer Systems
The Servet 3.0 benchmark suite: Characterization of network performance degradation

Computers and Electrical Engineering
Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications

International Journal of Parallel Programming
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012
The Zoltan and Isorropia parallel toolkits for combinatorial scientific computing: Partitioning, ordering and coloring

Scientific Programming - A New Overview of the Trilinos Project --Part 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities. We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities. We show that hwloc can already be used by popular high-performance OpenMP or MPI software. Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.