Cell GC: using the cell synergistic processor as a garbage collection coprocessor

Authors:
Chen-Yong Cher;Michael Gschwind
Affiliations:
IBM T J Watson Research Center, Yorktown Heights, NY;IBM T J Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Year:
2008

Citing 15
Cited 2

Space efficient conservative garbage collection

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Informing memory operations: memory performance feedback mechanisms and their applications

ACM Transactions on Computer Systems (TOCS)
Reducing garbage collector cache misses

Proceedings of the 2nd international symposium on Memory management
Concurrent garbage collection using hardware-assisted profiling

Proceedings of the 2nd international symposium on Memory management
Concurrent garbage collection using program slices on multithreaded processors

Proceedings of the 2nd international symposium on Memory management
Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Optimizing Compiler for the CELL Processor

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
An Open Source Environment for Cell Broadband Engine System Software

Computer
Effective prefetch for mark-sweep garbage collection

Proceedings of the 6th international symposium on Memory management
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor

International Journal of Parallel Programming

Memory management for many-core processors with software configurable locality policies

Proceedings of the 2012 international symposium on Memory Management
GPUs as an opportunity for offloading garbage collection

Proceedings of the 2012 international symposium on Memory Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, scaling of single-core superscalar processor performance has slowed due to complexity and power considerations. To improve program performance, designs are increasingly adopting chip multiprocessing with homogeneous or heterogeneous CMPs. By trading off features from a modern aggressive superscalar core, CMPs often offer better scaling characteristics in terms of aggregate performance, complexity and power, but often require additional software investment to rewrite, retune or recompile programs to take advantage of the new designs. The Cell Broadband Engine is a modern example of a heterogeneous CMP with coprocessors (accelerators) which can be found in supercomputers (Roadrunner), blade servers (IBM QS20/21), and video game consoles (SCEI PS3). A Cell BE processor has a host Power RISC processor (PPE) and eight Synergistic Processor Elements (SPE), each consisting of a Synergistic Processor Unit (SPU) and Memory Flow Controller (MFC). In this work, we explore the idea of offloading Automatic Dynamic Garbage Collection (GC) from the host processor onto accelerator processors using the coprocessor paradigm. Offloading part or all of GC to a coprocessor offers potential performance benefits, because while the coprocessor is running GC, the host processor can continue running other independent, more general computations. . We implement BDW garbage collection on a Cell system and offload the mark phase to the SPE co-processor. We show mark phase execution on the SPE accelerator to be competitive with execution on a full fledged PPE processor. We also explore object-based and block-based caching strategies for explicitly managed memory hierarchies, and explore to effectiveness of several prefetching schemes in the context of garbage collection. Finally, we implement Capitulative Loads using the DMA by extending software caches and quantify its performance impact on the coprocessor.