Impact of NUMA effects on high-speed networking with multi-opteron machines

Authors:
Stéphanie Moreaud;Brice Goglin
Affiliations:
INRIA -- LaBRI -- Université Bordeaux -- France;INRIA -- LaBRI -- Université Bordeaux -- France
Venue:
PDCS '07 Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems
Year:
2007

Citing 4
Cited 1

Scheduling and page migration for multiprocessor compute servers

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
On the importance of parallel application placement in NUMA multiprocessors

Sedms'93 USENIX Systems on USENIX Experiences with Distributed and Multiprocessor Systems - Volume 4
Exploring thread and memory placement on NUMA architectures: solaris and linux, UltraSPARC/FirePlane and opteron/hypertransport

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

Adaptive MPI multirail tuning for non-uniform input/output access

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

The ever-growing level of parallelism within the multi-core and multi-processor nodes in clusters leads to the generalization of distributed memory banks and busses with nonuniform access costs. These NUMA effects have been mostly studied in the context of threads scheduling and are known to have an influence on high-performance networking in clusters. We present an evaluation of their impact on communication performance in multi-Opteron machines. NUMA effects exhibit a strong and asymmetric impact on high-bandwidth communications while the impact on latency remains low. We then describe the implementation of an automatic NUMA-aware placement strategy which achieves as good communication performance as a careful manual placement, and thus ensures performance portability by gathering hardware topology information and placing communicating tasks accordingly.