A new approach to parallelising tracing algorithms

Authors:
Cosmin E. Oancea;Alan Mycroft;Stephen M. Watt
Affiliations:
The University of Cambridge, Cambridge, United Kingdom;The University of Cambridge, Cambridge, United Kingdom;The University of Western Ontario, London, ON, Canada
Venue:
Proceedings of the 2009 international symposium on Memory management
Year:
2009

Citing 25
Cited 5

MULTILISP: a language for concurrent symbolic computation

ACM Transactions on Programming Languages and Systems (TOPLAS)
Combining generational and conservative garbage collection: framework and implementations

POPL '90 Proceedings of the 17th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A concurrent copying garbage collector for languages that distinguish (im)mutable data

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A concurrent, generational garbage collector for a multithreaded implementation of ML

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The SPARC architecture manual (version 9)

The SPARC architecture manual (version 9)
Garbage collection: algorithms for automatic dynamic memory management

Garbage collection: algorithms for automatic dynamic memory management
Thread scheduling for multiprogrammed multiprocessors

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures
Reducing garbage collector cache misses

Proceedings of the 2nd international symposium on Memory management
A nonrecursive list compacting algorithm

Communications of the ACM
A parallel, real-time garbage collector

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
A scalable mark-sweep garbage collector on large-scale shared-memory machines

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Creating and preserving locality of java applications at allocation and garbage collection times

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Evaluation of Parallel Copying Garbage Collection on a Shared-Memory Multiprocessor

IEEE Transactions on Parallel and Distributed Systems
ACTOR SYSTEMS FOR REAL-TIME COMPUTATION

ACTOR SYSTEMS FOR REAL-TIME COMPUTATION
Survey of Distributed Garbage Collection Techniques

Survey of Distributed Garbage Collection Techniques
Oil and Water? High Performance Garbage Collection in Java with MMTk

Proceedings of the 26th International Conference on Software Engineering
Garbage collection without paging

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
A parallel, incremental, mostly concurrent garbage collector for servers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Improving locality with parallel hierarchical copying GC

Proceedings of the 5th international symposium on Memory management
Parallel garbage collection for shared memory multiprocessors

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
A customisable memory management framework

CTEC'94 Proceedings of the 6th conference on USENIX Sixth C++ Technical Conference - Volume 6
Parallel generational-copying garbage collection with a block-structured heap

Proceedings of the 7th international symposium on Memory management
Idempotent work stealing

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A comparative evaluation of parallel garbage collector implementations

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
A localized tracing scheme applied to garbage collection

APLAS'06 Proceedings of the 4th Asian conference on Programming Languages and Systems

Assessing the scalability of garbage collectors on many cores

PLOS '11 Proceedings of the 6th Workshop on Programming Languages and Operating Systems
Assessing the scalability of garbage collectors on many cores

ACM SIGOPS Operating Systems Review
Memory management for many-core processors with software configurable locality policies

Proceedings of the 2012 international symposium on Memory Management
A study of the scalability of stop-the-world garbage collectors on multicores

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Notions of aliasing and ownership

Aliasing in Object-Oriented Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tracing algorithms visit reachable nodes in a graph and are central to activities such as garbage collection, marshaling etc. Traditional sequential algorithms use a worklist, replacing a nodes with their unvisited children. Previous work on parallel tracing is processor-oriented in associating one worklist per processor: worklist insertion and removal requires no locking, and load balancing requires only occasional locking. However, since multiple queues may contain the same node, significant locking is necessary to avoid concurrent visits by competing processors. This paper presents a memory-oriented solution: memory is partitioned into segments and each segment has its own worklist containing only nodes in that segment. At a given time at most one processor owns a given worklist. By arranging separate single-reader-single-writer forwarding queues to pass nodes from processor i to processor j we can process objects in an order that gives lock-free mainline code and improved locality of reference. This refactoring is analogous to the way in which a compiler changes an iteration space to eliminate data dependencies. While it is clear that our solution can be more effective on NUMA systems and even necessary when processor-local memory may not be addressed from other processors, slightly surprisingly, it often gives significantly better speed-up on modern multi-cores architectures too. Using caches to hide memory latency loses much of its effectiveness when there is significant cross-processor memory contention or when locking is necessary.