A latency-conscious SMT branch prediction architecture

Authors:
Ayose Falcon;Oliverio J. Santana;Alex Ramirez;Mateo Valero
Affiliations:
Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain.;Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain.;Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain.;Computer Architecture Department, UPC, Jordi Girona 1-3, Modulo D6, 08034 Barcelona, Spain
Venue:
International Journal of High Performance Computing and Networking
Year:
2004

Citing 18
Cited 0

Next cache line and set prediction

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming

PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching

Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Trading conflict and capacity aliasing in conditional branch predictors

Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction

MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A scalable front-end architecture for fast instruction delivery

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures

Proceedings of the 27th annual international symposium on Computer architecture
The impact of delay on the design of branch predictors

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Optimizations Enabled by a Decoupled Front-End Architecture

IEEE Transactions on Computers
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Fetching instruction streams

Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reconsidering Complex Branch Predictors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Effective ahead pipelining of instruction block address generation

Proceedings of the 30th annual international symposium on Computer architecture
Spike: an optimizer for alpha/NT executables

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997

Quantified Score

Hi-index	0.00

Visualization

Abstract

Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current simultaneous multithreading (SMT) fetch designs, causing a performance drop due to the absence of instructions to execute. In this paper, we propose several solutions to reduce the effect of branch predictor delay in the performance of SMT processors. Firstly, we analyse the impact of varying the number of access ports. Secondly, we describe a decoupled implementation of an SMT fetch unit that helps to tolerate the predictor delay. Finally, we present an interthread pipelined branch predictor, based on creating a pipeline of interleaved predictions from different threads. Our results show that, combining all the proposed techniques, the performance obtained is similar to that obtained using an ideal, 1-cycle access branch predictor.