NUMA-aware memory manager with dominant-thread-based copying GC

Authors:
Takeshi Ogasawara
Affiliations:
IBM Research - Tokyo, Yamato, Japan
Venue:
Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
Year:
2009

Citing 21
Cited 8

Thin locks: featherweight synchronization for Java

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
A real-time garbage collector based on the lifetimes of objects

Communications of the ACM
Thread-specific heaps for multi-threaded programs

Proceedings of the 2nd international symposium on Memory management
Exploiting prolific types for memory management and optimizations

POPL '02 Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Thread-local heaps for Java

Proceedings of the 3rd international symposium on Memory management
Lock reservation: Java locks can mostly do without atomic operations

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Generation Scavenging: A non-disruptive high performance storage reclamation algorithm

SDE 1 Proceedings of the first ACM SIGSOFT/SIGPLAN software engineering symposium on Practical software development environments
Java server performance: a case study of building efficient, scalable Jvms

IBM Systems Journal
TO-Lock: Removing Lock Overhead Using the Owners' Temporal Locality

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Using Hardware Counters to Automatically Improve Memory Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
NUMA-Aware Java Heaps for Server Applications

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Hardware profile-guided automatic page placement for ccNUMA systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Clustering the heap in multi-threaded applications for improved garbage collection

Proceedings of the 8th annual conference on Genetic and evolutionary computation
A two-phase escape analysis for parallel java programs

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Eliminating synchronization-related atomic operations with biased locking and bulk rebiasing

Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Practical escape analyses: how good are they?

Proceedings of the 3rd international conference on Virtual execution environments
Data layouts for object-oriented programs

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
JavaTM just-in-time compiler and virtual machine improvements for server and middleware applications

VM'04 Proceedings of the 3rd conference on Virtual Machine Research And Technology Symposium - Volume 3
Windows NT in a ccNUMA system

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Data and thread affinity in openmp programs

Proceedings of the 2008 workshop on Memory access on future processors: a solved problem?
IBM POWER6 microarchitecture

IBM Journal of Research and Development

Memory system performance in a NUMA multicore multiprocessor

Proceedings of the 4th Annual International Conference on Systems and Storage
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Proceedings of the international symposium on Memory management
Assessing the scalability of garbage collectors on many cores

PLOS '11 Proceedings of the 6th Workshop on Programming Languages and Operating Systems
Assessing the scalability of garbage collectors on many cores

ACM SIGOPS Operating Systems Review
Parallel memory defragmentation on a GPU

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Scalable concurrent and parallel mark

Proceedings of the 2012 international symposium on Memory Management
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
A study of the scalability of stop-the-world garbage collectors on multicores

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a novel online method of identifying the preferred NUMA nodes for objects with negligible overhead during the garbage collection time as well as object allocation time. Since the number of CPUs (or NUMA nodes) is increasing recently, it is critical for the memory manager of the runtime environment of an object-oriented language to exploit the low latency of local memory for high performance. To locate the CPU of a thread that frequently accesses an object, prior research uses the runtime information about memory accesses as sampled by the hardware. However, the overhead of this approach is high for a garbage collector. Our approach uses the information about which thread can exclusively access an object, or the Dominant Thread (DoT). The dominant thread of an object is the thread that often most accesses an object so that we do not require memory access samples. Our NUMA-aware GC performs DoT based object copying, which copies each live object to the CPU where the dominant thread was last dispatched before GC. The dominant thread information is known from the thread stack and from objects that are locked or reserved by threads and is propagated in the object reference graph. We demonstrate that our approach can improve the performance of benchmark programs such as SPECpower ssj2008, SPECjbb2005, and SPECjvm2008.We prototyped a NUMAaware memory manager on a modified version of IBM Java VM and tested it on a cc-NUMA POWER6 machine with eight NUMA nodes. Our NUMA-aware GC achieved performance improvements up to 14.3% and 2.0% on average over a JVM that only used the NUMA-aware allocator. The total improvement using both the NUMA-aware allocator and GC is up to 53.1% and 10.8% on average.