Using Dynamic Binary Translation to Fuse Dependent Instructions

Authors:
Shiliang Hu;James E. Smith
Affiliations:
-;-
Venue:
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Year:
2004

Citing 18
Cited 14

The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
An out-of-order execution technique for runtime binary translators

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
A low-complexity issue logic

Proceedings of the 14th international conference on Supercomputing
Early load address resolution via register tracking

Proceedings of the 27th annual international symposium on Computer architecture
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the issue logic

ICS '01 Proceedings of the 15th international conference on Supercomputing
Dynamic Binary Translation and Optimization

IEEE Transactions on Computers
Efficient dynamic scheduling through tag elimination

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Select-free instruction scheduling logic

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Interlock Collapsing ALU's

IEEE Transactions on Computers
Dynamic binary translation for accumulator-oriented architectures

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
Half-price architecture

Proceedings of the 30th annual international symposium on Computer architecture
Data-Flow Prescheduling for Large Instruction Windows in Out-of-Order Processors

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
POWER4 system microarchitecture

IBM Journal of Research and Development

Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dynamic coalescing for 16-bit instructions

ACM Transactions on Embedded Computing Systems (TECS)
RENO: A Rename-Based Instruction Optimizer

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the design space of LUT-based transparent accelerators

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Reducing Startup Time in Co-Designed Virtual Machines

Proceedings of the 33rd annual international symposium on Computer Architecture
Scalable subgraph mapping for acyclic computation accelerators

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Serialization-Aware Mini-Graphs: Performance with Fewer Resources

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Narrow Accelerators with Data-Centric Subgraph Mapping

Proceedings of the International Symposium on Code Generation and Optimization
Dynamic configuration of application-specific implicit instructions for embedded pipelined processors

Proceedings of the 2008 ACM symposium on Applied computing
DVFS in loop accelerators using BLADES

Proceedings of the 45th annual Design Automation Conference
Recurrence-aware instruction set selection for extensible embedded processors

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Analysis of x86 ISA condition codes influence on superscalar execution

HiPC'07 Proceedings of the 14th international conference on High performance computing
Scalable multi-cores with improved per-core performance using off-the-critical path reconfigurable hardware

HiPC'08 Proceedings of the 15th international conference on High performance computing
Architecture Optimization of Application-Specific Implicit Instructions

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on CAPA'09, Special Section on WHS'09, and Special Section VCPSS' 09

Quantified Score

Hi-index	0.00

Visualization

Abstract

Instruction scheduling hardware can be simplifiedand easily pipelined if pairs of dependent instructionsare fused so they share a single instruction schedulingslot. We study an implementation of the x86 ISA thatdynamically translates x86 code to an underlying ISAthat supports instruction fusing. A microarchitecturethat is co-designed with the fused instruction set completesthe implementation.In this paper, we focus on the dynamic binarytranslator for such a co-designed x86 virtual machine.The dynamic binary translator first cracks x86 instructionsbelonging to hot superblocks into RISC-stylemicro-operations, and then uses heuristics to fuse togetherpairs of dependent micro-operations.Experimental results with SPEC2000 integer benchmarksdemonstrate that: (1) the fused ISA with dynamicbinary translation reduces the number of schedulingdecisions by about 30% versus a conventionalimplementation that uses hardware cracking into RISCmicro-operations; (2) an instruction scheduling slotneeds only hold two source register fields even thoughit may hold two instructions; (3) translations generatedin the proposed ISA consume about 30% less storagethan a corresponding fixed-length RISC-style ISA.