Hardware profile-guided automatic page placement for ccNUMA systems

  • Authors:
  • Jaydeep Marathe;Frank Mueller

  • Affiliations:
  • North Carolina State University, Raleigh, NC;North Carolina State University, Raleigh, NC

  • Venue:
  • Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Cache coherent non-uniform memory architectures (ccNUMA) constitute an important class of high-performance computing plat-forms. Contemporary ccNUMA systems, such as the SGI Altix, have a large number of nodes, where each node consists of a small number of processors and a fixed amount of physical memory. All processors in the system access the same global virtual address space but the physical memory is distributed across nodes, and coherence is maintained using hardware mechanisms. Accesses to local physical memory (on the same node as the requesting processor) results in lower latencies than accesses to remote memory (on a different node). Since many scientific programs are memory-bound, an intelligent page-placement policy that allocates pages closer to the requesting processor can significantly reduce number of cycles required to access memory. We show that such a policy can lead to significant savings in wall-clock execution time.In this paper, we introduce a novel hardware-assisted page placement scheme based on automated profiling. The placement scheme allocates pages near processors that most frequently access that page. The scheme leverages performance monitoring capabilities of contemporary microprocessors to efficiently extract an approximate trace of memory accesses. This information is used to decide page affinity, i.e., the node to which the page is bound. Our method operates entirely in user space, is widely automated, and handles not only static but also dynamic memory allocation.We evaluate our framework with a set of multi-threaded benchmarks from the NAS and SPEC OpenMP suites. We investigate the use of two different hardware profile sources with respect to the cost (e.g., time to trace, number of records in profile) vs. the accuracy of the profile and the corresponding savings in wall-clock execution time. We show that long-latency loads provide a better indicator for page placement than TLB misses.Our experiments show that our method can efficiently improve page placement, leading to an average wall-clock execution time saving of more than 20% for our benchmarks, with a one-time profiling overhead of 2.7% over the overall original program wallclock time. To the best of our knowledge, this is the first evaluation on a real machine of a completely user mode interrupt-driven profile-guided page placement scheme that requires no special compiler, operating system or network interconnect support.