Mapping applications for high performance on multithreaded, NUMA systems

Authors:
Guojing Cong;Huifang Wen
Affiliations:
IBM TJ Watson Research Center, Yorktown Heights, NY;IBM TJ Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the ACM International Conference on Computing Frontiers
Year:
2013

Citing 6
Cited 0

A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors

ICS '98 Proceedings of the 12th international conference on Supercomputing
Cache performance analysis of traversals and random accesses

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Provably good multicore cache performance for divide-and-conquer algorithms

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
Dynamic data migration for structured AMR solvers

International Journal of Parallel Programming
Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Journal of Parallel and Distributed Computing
Optimizing Large-scale Graph Analysis on Multithreaded, Multicore Platforms

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium

Quantified Score

Hi-index	0.00

Visualization

Abstract

The communication latency and available resources for a group of logical processors are determined by their relative position in the hierarchy of chips, cores, and threads on modern shared-memory systems. Multithreaded applications exhibit different performance behavior depending on the mapping of software threads to logical processors. We observe the execution time under one mapping can be 5.4 times as much as that under another. Applications with irregular access patterns show the worst performance under the default OS mapping. Mapping alone does not reduce remote accesses on NUMA machines when the logical processors span multiple chips. We present new data replication and distribution optimizations for two irregular applications. We further show that locality optimization reduces remote accesses and improves cache performance simultaneously and achieves better performance than prior NUMA-specific techniques.