Hardware Support for Control Transfers in Code Caches

Authors:
Ho-Seop Kim;James E. Smith
Affiliations:
Department of Electrical and Computer Engineering, University of Wisconsin - Madison;Department of Electrical and Computer Engineering, University of Wisconsin - Madison
Venue:
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2003

Citing 23
Cited 13

Branch history table prediction of moving target branches due to subroutine returns

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The multiflow trace scheduling compiler

The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
DIGITAL FX!32: combining emulation and binary translation

Digital Technical Journal
An out-of-order execution technique for runtime binary translators

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A hardware mechanism for dynamic extraction and relayout of program hot spots

Proceedings of the 27th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
The impact of delay on the design of branch predictors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic Binary Translation and Optimization

IEEE Transactions on Computers
Partial method compilation using dynamic profile information

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
An Architectural Framework for Supporting Heterogeneous Instruction-Set Architectures

Computer
PA-RISC to IA-64: Transparent Execution, No Recompilation

Computer
DELI: a new run-time control point

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic binary translation for accumulator-oriented architectures

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Retargetable and reconfigurable software dynamic translation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Dynamic profiling and trace cache generation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
The Effect of Code Reordering on Branch Prediction

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
POWER4 system microarchitecture

IBM Journal of Research and Development

Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An Event-Driven Multithreaded Dynamic Optimization Framework

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Reducing Startup Time in Co-Designed Virtual Machines

Proceedings of the 33rd annual international symposium on Computer Architecture
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems

Proceedings of the International Symposium on Code Generation and Optimization
TAO: two-level atomicity for dynamic binary optimizations

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Efficient binary translation system with low hardware cost

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
CoDBT: A multi-source dynamic binary translator using hardware-software collaborative techniques

Journal of Systems Architecture: the EUROMICRO Journal
Evaluating indirect branch handling mechanisms in software dynamic translation systems

ACM Transactions on Architecture and Code Optimization (TACO)
Harmonia: a transparent, efficient, and harmonious dynamic binary translator targeting the Intel® architecture

Proceedings of the 8th ACM International Conference on Computing Frontiers
LAR-CC: Large atomic regions with conditional commits

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A HW/SW co-designed heterogeneous multi-core virtual machine for energy-efficient general purpose computing

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
SPIRE: improving dynamic binary translation through SPC-indexed indirect branch redirecting

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many dynamic optimization and/or binary translationsystems hold optimized/translated superblocks in a codecache. Conventional code caching systems suffer fromoverheads when control is transferred from one cachedsuperblock to another, especially via register-indirectjumps. The basic problem is that instruction addresses inthe code cache are different from those in the original programbinary. Therefore, performance for register-indirectjumps depends on the ability to translate efficiently fromsource binary PC values to code cache PC values.We analyze several key aspects of superblock chainingand find that a conventional baseline code cache withsoftware jump target prediction results in 14.6% IPC lossversus the original binary. We identify the inability to usea conventional return address stack as the most significantperformance limiter in code cache systems. We introduce amodified software prediction technique that reduces theIPC loss to 11.4%. This technique is based on a techniqueused in threaded code interpreters.A number of hardware mechanisms, including a specializedreturn address stack and a hardware cache fortranslated jump target addresses, are studied for efficientlysupporting register-indirect jumps. Once all the chainingoverheads are removed by these support mechanisms, asuperblock-based code cache improves performance due toa better branch prediction rate, improved I-cache locality,and increased chances of straight-line fetches. Simulationresults show a 7.7% IPC improvement over a current generation4-way superscalar processor.