Space efficient conservative garbage collection
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Compiler-based prefetching for recursive data structures
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Informing memory operations: memory performance feedback mechanisms and their applications
ACM Transactions on Computer Systems (TOCS)
Reducing garbage collector cache misses
Proceedings of the 2nd international symposium on Memory management
Concurrent garbage collection using hardware-assisted profiling
Proceedings of the 2nd international symposium on Memory management
Concurrent garbage collection using program slices on multithreaded processors
Proceedings of the 2nd international symposium on Memory management
Software prefetching for mark-sweep garbage collection: hardware analysis and software redesign
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Optimizing Compiler for the CELL Processor
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Introduction to the cell multiprocessor
IBM Journal of Research and Development - POWER5 and packaging
Effective prefetch for mark-sweep garbage collection
Proceedings of the 6th international symposium on Memory management
The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor
International Journal of Parallel Programming
Memory management for many-core processors with software configurable locality policies
Proceedings of the 2012 international symposium on Memory Management
GPUs as an opportunity for offloading garbage collection
Proceedings of the 2012 international symposium on Memory Management
Hi-index | 0.00 |
In recent years, scaling of single-core superscalar processor performance has slowed due to complexity and power considerations. To improve program performance, designs are increasingly adopting chip multiprocessing with homogeneous or heterogeneous CMPs. By trading off features from a modern aggressive superscalar core, CMPs often offer better scaling characteristics in terms of aggregate performance, complexity and power, but often require additional software investment to rewrite, retune or recompile programs to take advantage of the new designs. The Cell Broadband Engine is a modern example of a heterogeneous CMP with coprocessors (accelerators) which can be found in supercomputers (Roadrunner), blade servers (IBM QS20/21), and video game consoles (SCEI PS3). A Cell BE processor has a host Power RISC processor (PPE) and eight Synergistic Processor Elements (SPE), each consisting of a Synergistic Processor Unit (SPU) and Memory Flow Controller (MFC). In this work, we explore the idea of offloading Automatic Dynamic Garbage Collection (GC) from the host processor onto accelerator processors using the coprocessor paradigm. Offloading part or all of GC to a coprocessor offers potential performance benefits, because while the coprocessor is running GC, the host processor can continue running other independent, more general computations. . We implement BDW garbage collection on a Cell system and offload the mark phase to the SPE co-processor. We show mark phase execution on the SPE accelerator to be competitive with execution on a full fledged PPE processor. We also explore object-based and block-based caching strategies for explicitly managed memory hierarchies, and explore to effectiveness of several prefetching schemes in the context of garbage collection. Finally, we implement Capitulative Loads using the DMA by extending software caches and quantify its performance impact on the coprocessor.