Near-optimal placement of MPI processes on hierarchical NUMA architectures
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Adaptive MPI multirail tuning for non-uniform input/output access
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Locality and topology aware intra-node communication among multicore CPUs
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Enabling locality-aware computations in OpenMP
Scientific Programming - Exploring Languages for Expressing Medium to Massive On-Chip Parallelism
Parallel memory prediction for fused linear algebra kernels
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
The TheLMA project: Multi-GPU implementation of the lattice Boltzmann method
International Journal of High Performance Computing Applications
Towards NUMA support with distance information
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Improving MPI applications performance on multicore clusters with rank reordering
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Multi-core and network aware MPI topology functions
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Adaptive parallel approximate similarity search for responsive multimedia retrieval
Proceedings of the 20th ACM international conference on Information and knowledge management
DAGuE: A generic distributed DAG engine for High Performance Computing
Parallel Computing
Computers and Electrical Engineering
Automatic NUMA characterization using Cbench
ICPE '12 Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering
Integrated in-system storage architecture for high performance computing
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
The design of OpenMP thread affinity
IWOMP'12 Proceedings of the 8th international conference on OpenMP in a Heterogeneous World
Characterizing and mitigating work time inflation in task parallel programs
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dynamic thread mapping based on machine learning for transactional memory applications
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Delegation-Based MPI communications for a hybrid parallel computer with many-core architecture
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework
Journal of Parallel and Distributed Computing
NUMA-aware shared-memory collective communication for MPI
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Improving performance of openSHMEM reference library by portable PE mapping technique
Proceedings of the 27th international ACM conference on International conference on supercomputing
Advancing application process affinity experimentation: open MPI's LAMA-based affinity interface
Proceedings of the 20th European MPI Users' Group Meeting
An implementation of the codelet model
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
ACM SIGAda Ada Letters
ACM Transactions on Architecture and Code Optimization (TACO)
A topology-aware load balancing algorithm for clustered hierarchical multi-core machines
Future Generation Computer Systems
The Servet 3.0 benchmark suite: Characterization of network performance degradation
Computers and Electrical Engineering
Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications
International Journal of Parallel Programming
Characterizing and mitigating work time inflation in task parallel programs
Scientific Programming - Selected Papers from Super Computing 2012
Scientific Programming - A New Overview of the Trilinos Project --Part 1
Hi-index | 0.00 |
The increasing numbers of cores, shared caches and memory nodes within machines introduces a complex hardware topology. High-performance computing applications now have to carefully adapt their placement and behavior according to the underlying hierarchy of hardware resources and their software affinities. We introduce the Hardware Locality (hwloc) software which gathers hardware information about processors, caches, memory nodes and more, and exposes it to applications and runtime systems in a abstracted and portable hierarchical manner. hwloc may significantly help performance by having runtime systems place their tasks or adapt their communication strategies depending on hardware affinities. We show that hwloc can already be used by popular high-performance OpenMP or MPI software. Indeed, scheduling OpenMP threads according to their affinities or placing MPI processes according to their communication patterns shows interesting performance improvement thanks to hwloc. An optimized MPI communication strategy may also be dynamically chosen according to the location of the communicating processes in the machine and its hardware characteristics.