Hardware profile-guided automatic page placement for ccNUMA systems

Authors:
Jaydeep Marathe;Frank Mueller
Affiliations:
North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC
Venue:
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
2006

Citing 6
Cited 18

NUMA policies and their relation to memory architecture

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Performance experiences on Sun's Wildfire prototype

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
User-Level Dynamic Page Migration for Multiprogrammed Shared-Memory Multiprocessors

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Proceedings of the 19th annual international conference on Supercomputing

Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

IEEE Transactions on Parallel and Distributed Systems
Feedback-directed thread scheduling with memory considerations

Proceedings of the 16th international symposium on High performance distributed computing
Hardware monitors for dynamic page migration

Journal of Parallel and Distributed Computing
Capturing performance knowledge for automated analysis

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
An Efficient OpenMP Runtime System for Hierarchical Architectures

IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
NUMA-aware memory manager with dominant-thread-based copying GC

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
UTS: an unbalanced tree search benchmark

LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Feedback-directed page placement for ccNUMA via hardware-generated memory traces

Journal of Parallel and Distributed Computing
Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
A dynamic optimization framework for OpenMP

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Exploiting thread-data affinity in OpenMP with data access patterns

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Node-based memory management for scalable NUMA architectures

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
MemProf: a memory profiler for NUMA multicore systems

USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Characterizing and mitigating work time inflation in task parallel programs

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Building portable thread schedulers for hierarchical multiprocessors: the bubblesched framework

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
A flexible and dynamic page migration infrastructure based on hardware counters

The Journal of Supercomputing
Characterizing and mitigating work time inflation in task parallel programs

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing plat-forms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a fixed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly reduce number of cycles required to access memory. We show that such a policy can lead to significant savings in wall-clock execution time.In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware profile sources with respect to the cost (e.g., time to trace, number of records in profile) vs. the accuracy of the profile and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses.Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time profiling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the first evaluation on a real machine of a completely user mode interrupt-driven profile-guided page placement scheme that requires no special compiler, operating system or network interconnect support.