Introduction to algorithms
Shade: a fast instruction-set simulator for execution profiling
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Embra: fast and flexible machine simulation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
DAISY: dynamic compilation for 100% architectural compatibility
Proceedings of the 24th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors
Proceedings of the sixteenth ACM symposium on Operating systems principles
Fast, effective code generation in a just-in-time Java compiler
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Execution characteristics of desktop applications on Windows NT
Proceedings of the 25th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
A framework for interprocedural optimization in the presence of dynamic class loading
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Adaptive optimization in the Jalapeño JVM
OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Software profiling for hot path prediction: less is more
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
SEDA: an architecture for well-conditioned, scalable internet services
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
FX!32: A Profile-Directed Binary Translator
IEEE Micro
Using Cohort-Scheduling to Enhance Server Performance
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
DELI: a new run-time control point
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Retargetable and reconfigurable software dynamic translation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Efficient implementation of the smalltalk-80 system
POPL '84 Proceedings of the 11th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Code Cache Management Schemes for Dynamic Optimizers
INTERACT '02 Proceedings of the Sixth Annual Workshop on Interaction between Compilers and Computer Architectures
Adaptive optimization for self: reconciling high performance with exploratory programming
Adaptive optimization for self: reconciling high performance with exploratory programming
IBM Systems Journal
Maintaining Consistency and Bounding Capacity of Software Code Caches
Proceedings of the international symposium on Code generation and optimization
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Efficient, transparent, and comprehensive runtime code manipulation
Efficient, transparent, and comprehensive runtime code manipulation
SimICS/sun4m: a virtual workstation
ATEC '98 Proceedings of the annual conference on USENIX Annual Technical Conference
Fast and efficient partial code reordering: taking advantage of dynamic recompilatior
Proceedings of the 5th international symposium on Memory management
HDTrans: an open source, low-level dynamic instrumentation system
Proceedings of the 2nd international conference on Virtual execution environments
Fragment cache management for dynamic binary translators in embedded systems with scratchpad
CASES '07 Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems
Scalable support for multithreaded applications on dynamic binary instrumentation systems
Proceedings of the 2009 international symposium on Memory management
DBT path selection for holistic memory efficiency and performance
Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Generating low-overhead dynamic binary translators
Proceedings of the 3rd Annual Haifa Experimental Systems Conference
Balancing memory and performance through selective flushing of software code caches
CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
A design for comprehensive kernel instrumentation
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Comprehensive kernel instrumentation via dynamic binary translation
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Aikido: accelerating shared data dynamic analyses
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Memory optimization of dynamic binary translators for embedded systems
ACM Transactions on Architecture and Code Optimization (TACO)
Enabling dynamic binary translation in embedded systems with scratchpad memory
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 0.00 |
Software code caches are increasingly being used to amortize the runtime overhead of dynamic optimizers, simulators, emulators, dynamic translators, dynamic compilers, and other tools. Despite the now-widespread use of code caches, techniques for efficiently sharing them across multiple threads have not been fully explored. Some systems simply do not support threads, while others resort to thread-private code caches. Although thread-private caches are much simpler to manage, synchronize, and provide scratch space for, they simply do not scale when applied to many-threaded programs. Thread-shared code caches are needed to target server applications, which employ hundreds of worker threads all performing similar tasks. Yet, those systems that do share their code caches often have bruteforce, inefficient solutions to the challenges of concurrent code cache access: a single global lock on runtime system code and suspension of all threads for any cache management action. This limits the possibilities for cache design and has performance problems with applications that require frequent cache invalidations to maintain cache consistency. In this paper, we discuss the design choices when building thread-shared code caches and enumerate the difficulties of thread-local storage, synchronization, trace building, in-cache lookup tables, and cache eviction. We present efficient solutions to these problems that both scale well and do not require thread suspension. We evaluate our results in DynamoRIO, an industrial-strength dynamic binary translation system, on real-world server applications. On these applications our thread-shared caches use an order of magnitude less memory and improve throughput by up to four times compared to threadprivate caches.