Thread-Shared Software Code Caches

Authors:
Derek Bruening;Vladimir Kiriansky;Timothy Garnett;Sanjeev Banerji
Affiliations:
Determina, Inc.;Determina, Inc.;Determina, Inc.;Determina, Inc.
Venue:
Proceedings of the International Symposium on Code Generation and Optimization
Year:
2006

Citing 26
Cited 13

Introduction to algorithms

Introduction to algorithms
Shade: a fast instruction-set simulator for execution profiling

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors

Proceedings of the sixteenth ACM symposium on Operating systems principles
Fast, effective code generation in a just-in-time Java compiler

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Execution characteristics of desktop applications on Windows NT

Proceedings of the 25th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A framework for interprocedural optimization in the presence of dynamic class loading

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Adaptive optimization in the Jalapeño JVM

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Software profiling for hot path prediction: less is more

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
SEDA: an architecture for well-conditioned, scalable internet services

SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
PA-RISC to IA-64: Transparent Execution, No Recompilation

Computer
FX!32: A Profile-Directed Binary Translator

IEEE Micro
Using Cohort-Scheduling to Enhance Server Performance

ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
DELI: a new run-time control point

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
The Transmeta Code Morphing™ Software: using speculation, recovery, and adaptive retranslation to address real-life challenges

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Retargetable and reconfigurable software dynamic translation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Efficient implementation of the smalltalk-80 system

POPL '84 Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Code Cache Management Schemes for Dynamic Optimizers

INTERACT '02 Proceedings of the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures
Adaptive optimization for self: reconciling high performance with exploratory programming

Adaptive optimization for self: reconciling high performance with exploratory programming
The Jalapeño virtual machine

IBM Systems Journal
Maintaining Consistency and Bounding Capacity of Software Code Caches

Proceedings of the international symposium on Code generation and optimization
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Efficient, transparent, and comprehensive runtime code manipulation

Efficient, transparent, and comprehensive runtime code manipulation
SimICS/sun4m: a virtual workstation

ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference

Fast and efficient partial code reordering: taking advantage of dynamic recompilatior

Proceedings of the 5th international symposium on Memory management
HDTrans: an open source, low-level dynamic instrumentation system

Proceedings of the 2nd international conference on Virtual execution environments
Fragment cache management for dynamic binary translators in embedded systems with scratchpad

CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Scalable support for multithreaded applications on dynamic binary instrumentation systems

Proceedings of the 2009 international symposium on Memory management
DBT path selection for holistic memory efficiency and performance

Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Generating low-overhead dynamic binary translators

Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Balancing memory and performance through selective flushing of software code caches

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A design for comprehensive kernel instrumentation

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Comprehensive kernel instrumentation via dynamic binary translation

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Aikido: accelerating shared data dynamic analyses

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Efficiently parallelizing instruction set simulation of embedded multi-core processors using region-based just-in-time dynamic binary translation

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Memory optimization of dynamic binary translators for embedded systems

ACM Transactions on Architecture and Code Optimization (TACO)
Enabling dynamic binary translation in embedded systems with scratchpad memory

ACM Transactions on Embedded Computing Systems (TECS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Software code caches are increasingly being used to amortize the runtime overhead of dynamic optimizers, simulators, emulators, dynamic translators, dynamic compilers, and other tools. Despite the now-widespread use of code caches, techniques for efficiently sharing them across multiple threads have not been fully explored. Some systems simply do not support threads, while others resort to thread-private code caches. Although thread-private caches are much simpler to manage, synchronize, and provide scratch space for, they simply do not scale when applied to many-threaded programs. Thread-shared code caches are needed to target server applications, which employ hundreds of worker threads all performing similar tasks. Yet, those systems that do share their code caches often have bruteforce, inefficient solutions to the challenges of concurrent code cache access: a single global lock on runtime system code and suspension of all threads for any cache management action. This limits the possibilities for cache design and has performance problems with applications that require frequent cache invalidations to maintain cache consistency. In this paper, we discuss the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction. We present efficient solutions to these problems that both scale well and do not require thread suspension. We evaluate our results in DynamoRIO, an industrial-strength dynamic binary translation system, on real-world server applications. On these applications our thread-shared caches use an order of magnitude less memory and improve throughput by up to four times compared to threadprivate caches.