Branch history table prediction of moving target branches due to subroutine returns
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The multiflow trace scheduling compiler
The Journal of Supercomputing - Special issue on instruction-level parallelism
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Embra: fast and flexible machine simulation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
DIGITAL FX!32: combining emulation and binary translation
Digital Technical Journal
An out-of-order execution technique for runtime binary translators
Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A hardware mechanism for dynamic extraction and relayout of program hot spots
Proceedings of the 27th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Dynamic Binary Translation and Optimization
IEEE Transactions on Computers
Partial method compilation using dynamic profile information
OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Performance characterization of a hardware mechanism for dynamic optimization
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
DELI: a new run-time control point
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic binary translation for accumulator-oriented architectures
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Retargetable and reconfigurable software dynamic translation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
An infrastructure for adaptive dynamic optimization
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Dynamic profiling and trace cache generation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The Effect of Code Reordering on Branch Prediction
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
POWER4 system microarchitecture
IBM Journal of Research and Development
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
An Event-Driven Multithreaded Dynamic Optimization Framework
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Reducing Startup Time in Co-Designed Virtual Machines
Proceedings of the 33rd annual international symposium on Computer Architecture
Evaluating Indirect Branch Handling Mechanisms in Software Dynamic Translation Systems
Proceedings of the International Symposium on Code Generation and Optimization
TAO: two-level atomicity for dynamic binary optimizations
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Efficient binary translation system with low hardware cost
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
CoDBT: A multi-source dynamic binary translator using hardware-software collaborative techniques
Journal of Systems Architecture: the EUROMICRO Journal
Evaluating indirect branch handling mechanisms in software dynamic translation systems
ACM Transactions on Architecture and Code Optimization (TACO)
Proceedings of the 8th ACM International Conference on Computing Frontiers
LAR-CC: Large atomic regions with conditional commits
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
SPIRE: improving dynamic binary translation through SPC-indexed indirect branch redirecting
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Warm-Up Simulation Methodology for HW/SW Co-Designed Processors
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
Many dynamic optimization and/or binary translationsystems hold optimized/translated superblocks in a codecache. Conventional code caching systems suffer fromoverheads when control is transferred from one cachedsuperblock to another, especially via register-indirectjumps. The basic problem is that instruction addresses inthe code cache are different from those in the original programbinary. Therefore, performance for register-indirectjumps depends on the ability to translate efficiently fromsource binary PC values to code cache PC values.We analyze several key aspects of superblock chainingand find that a conventional baseline code cache withsoftware jump target prediction results in 14.6% IPC lossversus the original binary. We identify the inability to usea conventional return address stack as the most significantperformance limiter in code cache systems. We introduce amodified software prediction technique that reduces theIPC loss to 11.4%. This technique is based on a techniqueused in threaded code interpreters.A number of hardware mechanisms, including a specializedreturn address stack and a hardware cache fortranslated jump target addresses, are studied for efficientlysupporting register-indirect jumps. Once all the chainingoverheads are removed by these support mechanisms, asuperblock-based code cache improves performance due toa better branch prediction rate, improved I-cache locality,and increased chances of straight-line fetches. Simulationresults show a 7.7% IPC improvement over a current generation4-way superscalar processor.