Bulldog: a compiler for VLSI architectures
Bulldog: a compiler for VLSI architectures
Compiling C for vectorization, parallelization, and inline expansion
PLDI '88 Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation
Branch history table prediction of moving target branches due to subroutine returns
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
Effective compiler support for predicated execution using the hyperblock
MICRO 25 Proceedings of the 25th annual international symposium on Microarchitecture
The superblock: an effective technique for VLIW and superscalar compilation
The Journal of Supercomputing - Special issue on instruction-level parallelism
Next cache line and set prediction
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic rescheduling: a technique for object code compatibility in VLIW architectures
Proceedings of the 28th annual international symposium on Microarchitecture
Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Path-based next trace prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A Trace Cache Microarchitecture and Evaluation
IEEE Transactions on Computers - Special issue on cache memory and related problems
A scalable front-end architecture for fast instruction delivery
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
Using cache line coloring to perform aggressive procedure inlining
ACM SIGARCH Computer Architecture News - Special issue on interaction between compilers and computer architectures
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Alto: a link-time optimizer for the Compaq alpha
Software—Practice & Experience
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Skipper: a microarchitecture for exploiting control-flow independence
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Conversion of control dependence to data dependence
POPL '83 Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages
Filtering Techniques to Improve Trace-Cache Efficiency
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
A Comprehensive Analysis of Indirect Branch Prediction
ISHPC '02 Proceedings of the 4th International Symposium on High Performance Computing
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
A Study of Control Independence in Superscalar Processors
HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Reconsidering Complex Branch Predictors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Effective ahead pipelining of instruction block address generation
Proceedings of the 30th annual international symposium on Computer architecture
Dynamic Branch Prediction with Perceptrons
HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Power Awareness through Selective Dynamically Optimized Traces
Proceedings of the 31st annual international symposium on Computer architecture
A low-complexity fetch architecture for high-performance superscalar processors
ACM Transactions on Architecture and Code Optimization (TACO)
Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
MinneSPEC: A New SPEC Benchmark Workload for Simulation-Based Computer Architecture Research
IEEE Computer Architecture Letters
Spike: an optimizer for alpha/NT executables
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
ISHPC'05/ALPS'06 Proceedings of the 6th international symposium on high-performance computing and 1st international conference on Advanced low power systems
Temporal instruction fetch streaming
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
RDIP: return-address-stack directed instruction prefetching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SHIFT: shared history instruction fetch for lean-core server processors
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
A hyperscalar dual-core architecture for embedded systems
Microprocessors & Microsystems
Hi-index | 14.98 |
The stream fetch engine is a high-performance fetch architecture based on the concept of instruction stream. We call stream to a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks. The long size of instruction streams makes it possible for the stream fetch engine to provide high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way for improving the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high performance results, which are comparable to state-of-the-art fetch architectures, but with a simpler design that consumes less energy.