A hardware mechanism for dynamic extraction and relayout of program hot spots

Authors:
Matthew C. Merten;Andrew R. Trick;Erik M. Nystrom;Ronald D. Barnes;Wen-mei W. Hmu
Affiliations:
Coordinated Science Lab, 1308 West Main Street, MC-228 Urbana, IL;Coordinated Science Lab, 1308 West Main Street, MC-228 Urbana, IL;Coordinated Science Lab, 1308 West Main Street, MC-228 Urbana, IL;Coordinated Science Lab, 1308 West Main Street, MC-228 Urbana, IL;Coordinated Science Lab, 1308 West Main Street, MC-228 Urbana, IL
Venue:
Proceedings of the 27th annual international symposium on Computer architecture
Year:
2000

Citing 10
Cited 19

Profile guided code positioning

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
DIGITAL FX!32: combining emulation and binary translation

Digital Technical Journal
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Evaluation of Design Options for the Trace Cache Fetch Mechanism

IEEE Transactions on Computers - Special issue on cache memory and related problems
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing

PipeRench implementation of the instruction path coprocessor

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Understanding the impact of X86/NT computing on microarchitecture

Workload characterization of emerging computer applications
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Dynamic trace selection using performance monitoring hardware sampling

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Selecting long atomic traces for high coverage

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Catching Accurate Profiles in Hardware

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
LLVA: A Low-level Virtual Instruction Set Architecture

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Hardware Support for Control Transfers in Code Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Using code collection to support large applications on mobile devices

Proceedings of the 10th annual international conference on Mobile computing and networking
Dynamic run-time architecture techniques for enabling continuous optimization

Proceedings of the 2nd conference on Computing frontiers
An Event-Driven Multithreaded Dynamic Optimization Framework

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Continuous Path and Edge Profiling

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Correcting the dynamic call graph using control-flow constraints

CC'07 Proceedings of the 16th international conference on Compiler construction
TAO: two-level atomicity for dynamic binary optimizations

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
DDGacc: boosting dynamic DDG-based binary optimizations through specialized hardware support

VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
PARROT: power awareness through selective dynamically optimized traces

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
Runtime adaptation: a case for reactive code alignment

Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era
Trace construction using enhanced performance monitoring

Proceedings of the ACM International Conference on Computing Frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Buffer to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces are not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruction fetch rates using only 12KB of hardware than a trace cache requiring 15KB of hardware, while producing long, persistent traces more suited to optimization.