Nonuniform memory affinity strategy in multithreaded sparse matrix computations

Authors:
Avinash Srinivasa;Masha Sosonkina
Affiliations:
Iowa State University Ames, IA;Iowa State University Ames, IA
Venue:
Proceedings of the 2012 Symposium on High Performance Computing
Year:
2012

Citing 17
Cited 1

The performance implications of locality information usage in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on multithreading for multiprocessors
Design and analysis of static memory management policies for CC-NUMA Multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
OpenMP: An Industry-Standard API for Shared-Memory Programming

IEEE Computational Science & Engineering
Evaluation of NUMA Memory Management Through Modeling and Measurements

IEEE Transactions on Parallel and Distributed Systems
Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Iterative Methods for Sparse Linear Systems

Iterative Methods for Sparse Linear Systems
Efficient operating system scheduling for performance-asymmetric multi-core architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
Accelerating configuration interaction calculations for nuclear structure

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes

PDP '09 Proceedings of the 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing
Enabling high-performance memory migration for multithreaded applications on LINUX

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Memory Affinity for Hierarchical Shared Memory Multiprocessors

SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
Addressing shared resource contention in multicore processors via scheduling

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A case for NUMA-aware contention management on multicore systems

USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
Dynamic Adaptations in ab-initio Nuclear Physics Calculations on Multicore Computer Architectures

IPDPSW '11 Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Profiling Directed NUMA Optimization on Linux Systems: A Case Study of the Gaussian Computational Chemistry Code

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Exploring thread and memory placement on NUMA architectures: solaris and linux, UltraSPARC/FirePlane and opteron/hypertransport

HiPC'06 Proceedings of the 13th international conference on High Performance Computing

SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the core counts on modern multiprocessor systems increase, so does the memory contention with all the processes/threads trying to access the main memory simultaneously. This is typical of UMA (Uniform Memory Access) architectures with a single physical memory bank leading to poor scalability in multithreaded applications. To alleviate this problem, modern systems are moving increasingly towards Nonuniform Memory Access (NUMA) architectures, in which the physical memory is split into several (typically two or four) banks. Each memory bank is associated with a set of cores enabling threads to operate from their own physical memory banks while retaining the concept of a shared virtual address space. However, accessing shared data structures from the remote memory banks may become increasingly slow. This paper proposes a way to determine and pin certain parts of the shared data to specific memory banks, thus minimizing remote accesses. To achieve this, the existing application code may be supplied with the proposed interface to set up and distribute shared data appropriately among memory banks. Experiments with the NAS CG benchmark as well as with a realistic large-scale application calculating ab initio nuclear structure have been performed. Speedups of up to 3.5 times were observed with the proposed approach compared with the default memory placement policy.