Parallelism in the front-end

Authors:
Paramjit S. Oberoi;Gurindar S. Sohi
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
Proceedings of the 30th annual international symposium on Computer architecture
Year:
2003

Citing 15
Cited 7

Branch prediction for free

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A fill-unit approach to multiple instruction issue

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
Reducing branch costs via branch alignment

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Optimization of instruction fetch mechanisms for high issue rates

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Reducing the performance impact of instruction cache misses by writing instructions into the reservation stations out-of-order

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Putting the fill unit to work: dynamic optimizations for trace cache microprocessors

MICRO 31 Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture
Software trace cache

ICS '99 Proceedings of the 13th international conference on Supercomputing
Increasing the size of atomic instruction blocks using control flow assertions

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Skipper: a microarchitecture for exploiting control-flow independence

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Out-of-Order Instruction Fetch Using Multiple Sequencers

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Design and evaluation of a multiscalar processor

Design and evaluation of a multiscalar processor

LLVA: A Low-level Virtual Instruction Set Architecture

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
WaveScalar

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Reducing Rename Logic Complexity for High-Speed and Low-Power Front-End Architectures

IEEE Transactions on Computers
Block-aware instruction set architecture

ACM Transactions on Architecture and Code Optimization (TACO)
Wide and efficient trace prediction using the local trace predictor

Proceedings of the 20th annual international conference on Supercomputing
Evaluating trace cache energy efficiency

ACM Transactions on Architecture and Code Optimization (TACO)
Ginger: control independence using tag rewriting

Proceedings of the 34th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

As processor back-ends get more aggressive, front-ends will have to scale as well. Although the back-ends of superscalar processors have continued to become more parallel, the front-ends remain sequential. This paper describes techniques for fetching and renaming multiple non-contiguous portions of the dynamic instruction stream in parallel using multiple fetch and rename units. It demonstrates that parallel front-ends are a viable alternative to high-performance sequential front-ends.Compared with an equivalently-sized trace cache, our technique increases cache bandwidth utilization by 17%, front-end throughput by 20%, and performance by 5%. Parallelism also enhances latency tolerance: a parallel front-end loses only 6% performance as the cache size is decreased from 128 KB to 8 KB, compared with a 50--65% performance loss for sequential fetch mechanisms.