Dynamic code footprint optimization for the IBM Cell Broadband Engine

Authors:
Tobias Werth;Tobias Flossmann;Michael Klemm;Dominic Schell;Ulrich Weigand;Michael Philippsen
Affiliations:
University of Erlangen-Nuremberg, Germany, Computer Science Department, Programming Systems Group, Martensstr. 3, 91058;University of Erlangen-Nuremberg, Germany, Computer Science Department, Programming Systems Group, Martensstr. 3, 91058;University of Erlangen-Nuremberg, Germany, Computer Science Department, Programming Systems Group, Martensstr. 3, 91058;University of Erlangen-Nuremberg, Germany, Computer Science Department, Programming Systems Group, Martensstr. 3, 91058;IBM Deutschland, Research&Development GmbH, Linux on Cell B. E. Development, Schöönaicher Str. 220, 71032 Böblingen, Germany;University of Erlangen-Nuremberg, Germany, Computer Science Department, Programming Systems Group, Martensstr. 3, 91058
Venue:
IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
Year:
2009

Citing 29
Cited 2

VCODE: a retargetable, extensible, very fast dynamic code generation system

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Garbage collection: algorithms for automatic dynamic memory management

Garbage collection: algorithms for automatic dynamic memory management
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Virtual Memory

ACM Computing Surveys (CSUR)
A proposal to establish a pseudo virtual memory via writable overlays

Communications of the ACM
Operating Systems: Program overlay techniques

Communications of the ACM
The performance of a system for automatic segmentation of programs within an ALGOL compiler (GIER ALGOL)

Communications of the ACM
Dynamic management of scratch-pad memory space

Proceedings of the 38th annual Design Automation Conference
Dynamic Binary Translation and Optimization

IEEE Transactions on Computers
Linkers and Loaders

Linkers and Loaders
DELI: a new run-time control point

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Retargetable and reconfigurable software dynamic translation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Assigning Program and Data Objects to Scratchpad for Energy Reduction

Proceedings of the conference on Design, automation and test in Europe
Adaptive code unloading for resource-constrained JVMs

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
Dynamic overlay of scratchpad memory for energy minimization

Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
A post-compiler approach to scratchpad mapping of code

Proceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems
Compiler Managed Dynamic Instruction Placement in a Low-Power Code Cache

Proceedings of the international symposium on Code generation and optimization
BB-GC: Basic-Block Level Garbage Collection

Proceedings of the conference on Design, Automation and Test in Europe - Volume 2
Operating Systems Design and Implementation (3rd Edition)

Operating Systems Design and Implementation (3rd Edition)
Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

IBM Systems Journal
Managing bounded code caches in dynamic binary optimization systems

ACM Transactions on Architecture and Code Optimization (TACO)
Software-based instruction caching for embedded processors

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Compilers: Principles, Techniques, and Tools (2nd Edition)

Compilers: Principles, Techniques, and Tools (2nd Edition)
Heap data allocation to scratch-pad memory in embedded systems

Journal of Embedded Computing - Cache exploitation in embedded systems
Dynamic data scratchpad memory management for a memory subsystem with an MMU

Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
The java hotspotTM server compiler

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Thrashing: its causes and prevention

AFIPS '68 (Fall, part I) Proceedings of the December 9-11, 1968, fall joint computer conference, part I

International workshop on multicore software engineering (IWMSE 2009)

ICSE '09 COMPANION Proceedings of the 2009 31st International Conference on Software Engineering: Companion Volume
Reducing memory space consumption through dataflow analysis

Computer Languages, Systems and Structures

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multicore designers often add a small local memory close to each core to speed up access and to reduce off-chip IO. But this approach puts a burden on the programmer, the compiler, and the runtime system, since this memory lacks hardware support (cache logic, MMU, …) and hence needs to be managed in software to exploit its performance potential. The IBM Cell Broadband Engine (Cell B. E.) is extreme in this respect, since each of the parallel cores can only address code and data in its own local memory directly. Overlay techniques from the 70ies solve this problem with the well-known drawbacks: The programmer must manually divide the program into overlays and the largest overlay determines how much data the application can work with. In our approach, programmers do no longer need to cut overlays. Instead, we automatically and at runtime fragment and load small code snippets into a code cache located in the local stores and supervised by a garbage collector. Since our loader does not load code that is not needed for execution, the code cache can be much smaller (up to 70%) than the original program size. Applications can therefore work on larger data sets, i. e., bigger problems. Our loader is highly efficient and slows down applications by less than 5% on average. It can load any native code without pre-processing or changes in the software tool chain.