Next cache line and set prediction
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
Trading conflict and capacity aliasing in conditional branch predictors
Proceedings of the 24th annual international symposium on Computer architecture
Path-based next trace prediction
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
A scalable front-end architecture for fast instruction delivery
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Clock rate versus IPC: the end of the road for conventional microarchitectures
Proceedings of the 27th annual international symposium on Computer architecture
The impact of delay on the design of branch predictors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Optimizations Enabled by a Decoupled Front-End Architecture
IEEE Transactions on Computers
The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Design tradeoffs for the Alpha EV8 conditional branch predictor
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture
Reconsidering Complex Branch Predictors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Effective ahead pipelining of instruction block address generation
Proceedings of the 30th annual international symposium on Computer architecture
Spike: an optimizer for alpha/NT executables
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
Hi-index | 0.00 |
Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point calculation, the processor can switch to another context so that another thread can take advantage of the idle resources. However, fetch stall conditions caused by a branch predictor delay are not hidden by current simultaneous multithreading (SMT) fetch designs, causing a performance drop due to the absence of instructions to execute. In this paper, we propose several solutions to reduce the effect of branch predictor delay in the performance of SMT processors. Firstly, we analyse the impact of varying the number of access ports. Secondly, we describe a decoupled implementation of an SMT fetch unit that helps to tolerate the predictor delay. Finally, we present an interthread pipelined branch predictor, based on creating a pipeline of interleaved predictions from different threads. Our results show that, combining all the proposed techniques, the performance obtained is similar to that obtained using an ideal, 1-cycle access branch predictor.