Evaluation of Design Options for the Trace Cache Fetch Mechanism
IEEE Transactions on Computers - Special issue on cache memory and related problems
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Managing multi-configuration hardware via dynamic working set analysis
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Vacuum packing: extracting hardware-detected program phases for post-link optimization
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Dynamic binary translation for accumulator-oriented architectures
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Retargetable and reconfigurable software dynamic translation
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Proceedings of the 30th annual international symposium on Computer architecture
Exploiting program hotspots and code sequentiality for instruction cache leakage management
Proceedings of the 2003 international symposium on Low power electronics and design
RABIT: A New Framework for Runtime Emulation and Binary Translation
ANSS '04 Proceedings of the 37th annual symposium on Simulation
Method-level phase behavior in java workloads
OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Effective Adaptive Computing Environment Management via Dynamic Optimization
Proceedings of the international symposium on Code generation and optimization
Proceedings of the 32nd annual international symposium on Computer Architecture
Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware
IEEE Transactions on Computers
Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining
IEEE Transactions on Computers
Selecting Software Phase Markers with Code Structure Analysis
Proceedings of the International Symposium on Code Generation and Optimization
Reducing Startup Time in Co-Designed Virtual Machines
Proceedings of the 33rd annual international symposium on Computer Architecture
Effective management of multiple configurable units using dynamic optimization
ACM Transactions on Architecture and Code Optimization (TACO)
A smart random code injection to mask power analysis based side channel attacks
CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Using hpm-sampling to drive dynamic compilation
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
PEAK—a fast and effective performance tuning system via compiler optimization orchestration
ACM Transactions on Programming Languages and Systems (TOPLAS)
A low-power phase change memory based hybrid cache architecture
Proceedings of the 18th ACM Great Lakes symposium on VLSI
Direct address translation for virtual memory in energy-efficient embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Cross-layer customization for rapid and low-cost task preemption in multitasked embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Detecting phases in parallel applications on shared memory architectures
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Trace-Based runtime instruction rescheduling for architecture extension
ICESS'05 Proceedings of the Second international conference on Embedded Software and Systems
Randomized Instruction Injection to Counter Power Analysis Attacks
ACM Transactions on Embedded Computing Systems (TECS)
Hi-index | 14.99 |
Wide-issue processors continue to achieve higher performance by exploiting greater instruction-level parallelism. Dynamic techniques such as out-of-order execution and hardware speculation have proven effective at increasing instruction throughput. Run-time optimization promises to provide an even higher level of performance by adaptively applying aggressive code transformations on a larger scope. This paper presents a new hardware mechanism for generating and deploying runtime optimized code. The mechanism can be viewed as a filtering system that resides in the retirement stage of the processor pipeline, accepts an instruction execution stream as input, and produces instruction profiles and sets of linked, optimized traces as output. The code deployment mechanism uses an extension to the branch prediction mechanism to migrate execution into the new code without modifying the original code. These new components do not add delay to the execution of the program except during short bursts of reoptimization. This technique provides a strong platform for runtime optimization because the hot execution regions are extracted, optimized, and written to main memory for execution and because these regions persist across context switches. The current design of the framework supports a suite of optimizations, including partial function inlining (even into shared libraries), code straightening optimizations, loop unrolling, and peephole optimizations.