Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication

Authors:
Peter G. Sassone;D. Scott Wills
Affiliations:
Georgia Institute of Technology;Georgia Institute of Technology
Venue:
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Year:
2004

Citing 25
Cited 24

Interlock collapsing ALU for increased instruction-level parallelism

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The performance potential of data dependence speculation & collapsing

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Complexity-effective superscalar processors

Proceedings of the 24th annual international symposium on Computer architecture
The multicluster architecture: reducing cycle time through partitioning

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Simultaneous subordinate microthreading (SSMT)

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Slice-processors: an implementation of operation-based prediction

ICS '01 Proceedings of the 15th international conference on Supercomputing
rePLay: A Hardware Framework for Dynamic Optimization

IEEE Transactions on Computers
Efficient dynamic scheduling through tag elimination

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
An instruction set and microarchitecture for instruction level distributed processing

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Performance characterization of a hardware mechanism for dynamic optimization

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Reducing the complexity of the register file in dynamic superscalar processors

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Dynamic speculative precomputation

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
High-Performance 3-1 Interlock Collapsing ALU's

IEEE Transactions on Computers
Efficient Interconnects for Clustered Microarchitectures

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Reducing register ports for higher speed and lower energy

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Instruction Pre-Processing in Trace Processors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Half-price architecture

Proceedings of the 30th annual international symposium on Computer architecture
Improving dynamic cluster assignment for clustered trace cache processors

Proceedings of the 30th annual international symposium on Computer architecture
Speculative Data-Driven Multithreading

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation

Proceedings of the 31st annual international symposium on Computer architecture

Static strands: safely collapsing dependence chains for increasing embedded power efficiency

LCTES '05 Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the design space of LUT-based transparent accelerators

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
SPARTAN: speculative avoidance of register allocations to transient values for performance and energy efficiency

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Serialization-Aware Mini-Graphs: Performance with Fewer Resources

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Matrix scheduler reloaded

Proceedings of the 34th annual international symposium on Computer architecture
Static strands: Safely exposing dependence chains for increasing embedded power efficiency

ACM Transactions on Embedded Computing Systems (TECS) - Special Section LCTES'05
Predicting and Exploiting Transient Values for Reducing Register File Pressure and Energy Consumption

IEEE Transactions on Computers
Dynamic configuration of application-specific implicit instructions for embedded pipelined processors

Proceedings of the 2008 ACM symposium on Applied computing
Achieving Out-of-Order Performance with Almost In-Order Complexity

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
An architecture framework for an adaptive extensible processor

The Journal of Supercomputing
Impact of Software Bypassing on Instruction Level Parallelism and Register File Traffic

SAMOS '08 Proceedings of the 8th international workshop on Embedded Computer Systems: Architectures, Modeling, and Simulation
Selective writeback: reducing register file pressure and energy consumption

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An Application Development Framework for ARISE Reconfigurable Processors

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Dynamically utilizing computation accelerators for extensible processors in a software approach

CODES+ISSS '09 Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Scalable multi-cores with improved per-core performance using off-the-critical path reconfigurable hardware

HiPC'08 Proceedings of the 15th international conference on High performance computing
A pattern based instruction encoding technique for high performance architectures

International Journal of High Performance Systems Architecture
Mighty-morphing power-SIMD

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Comparing FPGA vs. custom cmos and the impact on processor microarchitecture

Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
The ARISE approach for extending embedded processors with arbitrary hardware accelerators

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Functional unit chaining: a runtime adaptive architecture for reducing bypass delays

ACSAC'06 Proceedings of the 11th Asia-Pacific conference on Advances in Computer Systems Architecture
Dynamic instruction cascading on GALS microprocessors

PATMOS'05 Proceedings of the 15th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Architecture Optimization of Application-Specific Implicit Instructions

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on CAPA'09, Special Section on WHS'09, and Special Section VCPSS' 09
A just-in-time customizable processor

Proceedings of the International Conference on Computer-Aided Design

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the modern era of wire-dominated architectures, specific effort must be made to reduce needless communication within out-of-order pipelines while still maintaining binary compatibility. To ease pressure on highly-connected elements such as the issue logic and bypass network, we propose the dynamic detection and speculative execution of instruction strands-linear chains of dependent instructions without intermediate fan-out. The hardware required for detecting these chains is simple and resides off the critical path of the pipeline, and the execution targets are the normal ALUs with a self-bypass mode. By collapsing these strings of dependencies into atomic macro-instructions, the efficiency of the issue queue and reorder buffer can be increased. Our results show that over 25% of all dynamic ALU instructions can be grouped, decreasing both the average reorder buffer occupancy and issue queue occupancy by over a third. Additionally, these strands have several properties which make them amenable to simple performance optimizations. Our experiments show average IPC increases of 17% on a four-wide machine and 20% on an eight-wide machine in Spec2000int and Mediabench applications. Finally, strands ease the IPC penalties of multicycle issue and bypass by reducing dependency pressures, providing opportunity for clock frequency gains as well.