Impact of sharing-based thread placement on multithreaded architectures
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Data distribution support on distributed shared memory multiprocessors
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
A case for user-level dynamic page migration
Proceedings of the 14th international conference on Supercomputing
Is data distribution necessary in OpenMP?
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computations
Journal of Parallel and Distributed Computing - Special section best papers from the 2002 international parallel and distributed processing symposium
Using Hardware Counters to Automatically Improve Memory Performance
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Programming for parallelism and locality with hierarchically tiled arrays
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Locality and Loop Scheduling on NUMA Multiprocessors
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
What can performance counters do for memory subsystem analysis?
Proceedings of the 2008 ACM SIGPLAN workshop on Memory systems performance and correctness: held in conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08)
Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs?
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Cache topology aware computation mapping for multicores
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Feedback-directed page placement for ccNUMA via hardware-generated memory traces
Journal of Parallel and Distributed Computing
Proceedings of the international symposium on Memory management
A case for NUMA-aware contention management on multicore systems
USENIXATC'11 Proceedings of the 2011 USENIX conference on USENIX annual technical conference
A template library to integrate thread scheduling and locality management for NUMA multiprocessors
HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Actor scheduling for multicore hierarchical memory platforms
Proceedings of the twelfth ACM SIGPLAN workshop on Erlang
A flexible and dynamic page migration infrastructure based on hardware counters
The Journal of Supercomputing
Maximizing the performance of irregular applications on multithreaded, NUMA systems
IA^3 '13 Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms
A tool to analyze the performance of multithreaded programs on NUMA architectures
Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Improving execution unit occupancy on SMT-based processors through hardware-aware thread scheduling
Future Generation Computer Systems
Hi-index | 0.00 |
Many recent multicore multiprocessors are based on a nonuniform memory architecture (NUMA). A mismatch between the data access patterns of programs and the mapping of data to memory incurs a high overhead, as remote accesses have higher latency and lower throughput than local accesses. This paper reports on a limit study that shows that many scientific loop-parallel programs include multiple, mutually incompatible data access patterns, therefore these programs encounter a high fraction of costly remote memory accesses. Matching the data distribution of a program to the individual data access patterns is possible, however it is difficult to find a data distribution that matches all access patterns. Directives as included in, e.g., OpenMP provide a way to distribute the computation, but the induced data partitioning does not take into account the placement of data into the processors' memory. To alleviate this problem we describe a small set of language-level primitives for memory allocation and loop scheduling. Using the primitives together with simple program-level transformations eliminates mutually incompatible access patterns from OpenMP-style parallel programs. This result represents an improvement of up to 3.3X over the default setup, and the programs obtain a speedup of up to 33.6X over single-core execution (19X on average) on a 4-processor 32-core machine.