The performance potential of data dependence speculation & collapsing

Authors:
Yiannakis Sazeides;Stamatis Vassiliadis;James E. Smith
Affiliations:
Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engr. Dr., Madison, WI;Department of Electrical Engineering, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands;Department of Electrical and Computer Engineering, University of Wisconsin-Madison, 1415 Engr. Dr., Madison, WI
Venue:
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Year:
1996

Citing 18
Cited 42

Available instruction-level parallelism for superscalar and superpipelined machines

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
IBM RISC System/6000 processor architecture

IBM Journal of Research and Development
Design of the IBM RISC System/6000 floating-point execution unit

IBM Journal of Research and Development
Limits of instruction-level parallelism

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Single instruction stream parallelism is greater than two

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The SPARC architecture manual: version 8

The SPARC architecture manual: version 8
Limits of control flow on parallelism

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Dynamic dependency analysis of ordinary programs

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Interlock collapsing ALU for increased instruction-level parallelism

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
SCISM: a scalable compound instruction set machine

IBM Journal of Research and Development
IBM Power and PowerPC

IBM Power and PowerPC
Zero-cycle loads: microarchitecture support for reducing load latency

Proceedings of the 28th annual international symposium on Microarchitecture
Value locality and load value prediction

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Examination of a memory access classification scheme for pointer-intensive and numeric programs

ICS '96 Proceedings of the 10th international conference on Supercomputing
Efficient program tracing

Computer
Interlock Collapsing ALU's

IEEE Transactions on Computers
High-Performance 3-1 Interlock Collapsing ALU's

IEEE Transactions on Computers
Effective Hardware-Based Data Prefetching for High-Performance Processors

IEEE Transactions on Computers

Speculative execution via address prediction and data prefetching

ICS '97 Proceedings of the 11th international conference on Supercomputing
The design and performance of a conflict-avoiding cache

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Trace processors

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Improving the accuracy and performance of memory communication through renaming

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Microarchitecture support for improving the performance of load target prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Streamlining inter-operation memory communication via data dependence prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The predictability of data values

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
The potential of data value speculation to boost ILP

ICS '98 Proceedings of the 12th international conference on Supercomputing
Load execution latency reduction

ICS '98 Proceedings of the 12th international conference on Supercomputing
Speculative multithreaded processors

ICS '98 Proceedings of the 12th international conference on Supercomputing
The effect of instruction fetch bandwidth on value prediction

Proceedings of the 25th annual international symposium on Computer architecture
Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Compiler-directed early load-address generation

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Randomized Cache Placement for Eliminating Conflicts

IEEE Transactions on Computers - Special issue on cache memory and related problems
Correlated load-address predictors

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Cyclic dependence based data reference prediction

ICS '99 Proceedings of the 13th international conference on Supercomputing
Clustered speculative multithreaded processors

ICS '99 Proceedings of the 13th international conference on Supercomputing
Control independence in trace processors

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Value prediction for speculative multithreaded architectures

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Memory Renaming: Fast, Early and Accurate Processing of Memory Communication

International Journal of Parallel Programming
Limits of Data Value Predictability

International Journal of Parallel Programming
CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit

Proceedings of the 27th annual international symposium on Computer architecture
Speculative Memory Cloaking and Bypassing

International Journal of Parallel Programming - Special issue on the 30th annual ACM/IEEE international symposium on microarchitecture, part II
Reducing wire delay penalty through value prediction

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Designing domain-specific processors

Proceedings of the ninth international symposium on Hardware/software codesign
An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors

International Journal of Parallel Programming
Load Redundancy Removal through Instruction Reuse

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Macro-op Scheduling: Relaxing Scheduling Loop Constraints

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Thread Partitioning and Value Prediction for Exploiting Speculative Thread-Level Parallelism

IEEE Transactions on Computers
Cluster prefetch: tolerating on-chip wire delays in clustered microarchitectures

Proceedings of the 18th annual international conference on Supercomputing
From Sequences of Dependent Instructions to Functions: An Approach for Improving Performance without ILP or Speculation

Proceedings of the 31st annual international symposium on Computer architecture
Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Proceedings of the 32nd annual international symposium on Computer Architecture
Exploring the design space of LUT-based transparent accelerators

Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems
Speculative parallelization of multipath radiosity algorithm

SPECTS'09 Proceedings of the 12th international conference on Symposium on Performance Evaluation of Computer & Telecommunication Systems
The potential of using dynamic information flow analysis in data value prediction

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Mighty-morphing power-SIMD

CASES '10 Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems
Leveraging Strength-Based Dynamic Information Flow Analysis to Enhance Data Value Prediction

ACM Transactions on Architecture and Code Optimization (TACO)
Aggressive Value Prediction on a GPU

International Journal of Parallel Programming

Quantified Score

Hi-index	0.01

Visualization

Abstract

Two hardware methods for remedying the effects of true data dependences are studied. The first method dependence speculation, is used to eliminate address generation-load dependences. This is enabled by address prediction that permits load instructions to proceed speculatively without waiting for their address operands. The second technique, dependence collapsing, is used to eliminate data dependences by combining a dependence among multiple instructions into one instruction. The potential of these techniques for improving processor performance is demonstrated via trace-driven simulation. When both techniques are used with maximum issue widths of 4, 8, 16, and 32, the overall speedups in comparison to a base instruction level parallel machine are 1.20, 1.35, 1.51, and 1.66, respectively. In general, dependence collapsing contributes the majority of the improvement in performance. Under the dependence collapsing model, 298 to 478 of the total number of instructions in a trace may be collapsed. The distance separating the collapsed instructions is nearly always less than 8. Our experimentation also suggests that further performance improvements can be achieved by incorporating mechanisms that increase the address prediction rate.