Enlarging Instruction Streams

Authors:
Oliverio J. Santana;Alex Ramirez;Mateo Valero
Affiliations:
-;-;-
Venue:
IEEE Transactions on Computers
Year:
2007

Citing 33
Cited 5

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Compiling C for vectorization, parallelization, and inline expansion

PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Branch history table prediction of moving target branches due to subroutine returns

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock

MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The superblock: an effective technique for VLIW and superscalar compilation

The Journal of Supercomputing - Special issue on instruction-level parallelism
Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic rescheduling: a technique for object code compatibility in VLIW architectures

Proceedings of the 28th annual international symposium on Microarchitecture
Aggressive inlining

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A Trace Cache Microarchitecture and Evaluation

IEEE Transactions on Computers - Special issue on cache memory and related problems
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
Using cache line coloring to perform aggressive procedure inlining

ACM SIGARCH Computer Architecture News - Special issue on interaction between compilers and computer architectures
The impact of delay on the design of branch predictors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Alto: a link-time optimizer for the Compaq alpha

Software—Practice & Experience
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Skipper: a microarchitecture for exploiting control-flow independence

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Conversion of control dependence to data dependence

POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Filtering Techniques to Improve Trace-Cache Efficiency

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
A Comprehensive Analysis of Indirect Branch Prediction

ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Study of Control Independence in Superscalar Processors

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Reconsidering Complex Branch Predictors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Effective ahead pipelining of instruction block address generation

Proceedings of the 30th annual international symposium on Computer architecture
Dynamic Branch Prediction with Perceptrons

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Power Awareness through Selective Dynamically Optimized Traces

Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors

ACM Transactions on Architecture and Code Optimization (TACO)
Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research

IEEE Computer Architecture Letters
Spike: an optimizer for alpha/NT executables

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Multiple stream prediction

ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems

Temporal instruction fetch streaming

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Proactive instruction fetch

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
RDIP: return-address-stack directed instruction prefetching

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A hyperscalar dual-core architecture for embedded systems

Microprocessors & Microsystems

Quantified Score

Hi-index	14.98

Visualization

Abstract

The stream fetch engine is a high-performance fetch architecture based on the concept of instruction stream. We call stream to a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for the stream fetch engine to provide high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way for improving the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high performance results, which are comparable to state-of-the-art fetch architectures, but with a simpler design that consumes less energy.