Dynamic Optimization of Micro-Operations

Authors:
Brian Slechta;David Crowe;Brian Fahs;Michael Fertig;Gregory Muthler;Justin Quek;Francesco Spadini;Sanjay J. Patel;Steven S. Lumetta
Affiliations:
-;-;-;-;-;-;-;-;-
Venue:
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Year:
2003

Citing 14
Cited 9

Global value numbers and redundant computations

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Facilitating superscalar processing via a combined static/dynamic register renaming scheme

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Enhancing instruction scheduling with a block-structured ISA

International Journal of Parallel Programming
Improving superscalar instruction dispatch and issue by exploiting dynamic code sequences

Proceedings of the 24th annual international symposium on Computer architecture
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Increasing the instruction fetch rate via block-structured instruction set architectures

International Journal of Parallel Programming - Special issue: MICRO-29, 29th annual IEEE/ACM international symposium on microarchitecture
A hardware-driven profiling scheme for identifying program hot spots to support runtime optimization

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Instruction path coprocessors

Proceedings of the 27th annual international symposium on Computer architecture
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Dynamic and Transparent Binary Translation

Computer
Instruction Pre-Processing in Trace Processors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Trace Scheduling: A Technique for Global Microcode Compaction

IEEE Transactions on Computers

Improving quasi-dynamic schedules through region slip

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Selecting long atomic traces for high coverage

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Power Awareness through Selective Dynamically Optimized Traces

Proceedings of the 31st annual international symposium on Computer architecture
Constructing Virtual Architectures on a Tiled Processor

Proceedings of the International Symposium on Code Generation and Optimization
Reducing energy of virtual cache synonym lookup using bloom filters

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
TAO: two-level atomicity for dynamic binary optimizations

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
PARROT: power awareness through selective dynamically optimized traces

PACS'03 Proceedings of the Third international conference on Power - Aware Computer Systems
LAR-CC: Large atomic regions with conditional commits

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Inherent within complex instruction set architectures such as x86 are inefficiencies that do not exist in a simpler ISAs. Modern x86 implementations decode instructions into one or more micro-operations in order to deal with the complexity of the ISA. Since these micro-operations are not visible to the compiler, the stream of micro-operations can contain redundancies even in statically optimized x86 code. Within a processor implementation, however, barriers at the ISA level do not apply, and these redundancies can be removed by optimizing the micro-operation stream.In this paper, we explore the opportunities to optimize code at the micro-operation granularity. We execute these micro-operation optimizations using the rePLay Framework as a microarchitectural substrate. Using a simple set of seven optimizations, including two that aggressively and speculatively attempt to remove redundant load instructions, we examine the effects of dynamic optimization of micro-operations using a trace-driven simulation environment.Simulation reveals that across a sampling of SPECint 2000 and real x86 applications, rePLay is able to reduce micro-operation count by 21% and, in particular, load micro-operation count by 22%. These reductions correspond to a boost in observed instruction-level parallelism on an 8-wide optimizing rePLay processor by 17% over a non-optimizing configuration.