Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
On pipelining dynamic instruction scheduling logic
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A circuit level implementation of an adaptive issue queue for power-aware microprocessors
GLSVLSI '01 Proceedings of the 11th Great Lakes symposium on VLSI
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Handling long-latency loads in a simultaneous multithreading processor
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Automatically characterizing large scale program behavior
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Front-End Policies for Improved Issue Efficiency in SMT Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
The Impact of Resource Partitioning on SMT Processors
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Dynamically Controlled Resource Allocation in SMT Processors
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Understanding Scheduling Replay Schemes
HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Thread-Sensitive Instruction Issue for SMT Processors
IEEE Computer Architecture Letters
An adaptive resource partitioning algorithm for SMT processors
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
A swarm-inspired resource distribution for SMT processors
Proceedings of the 3rd International Conference on Bio-Inspired Models of Network, Information and Computing Sytems
Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Adaptive instruction dispatching techniques for Simultaneous Multi-Threading (SMT) processors
Computers and Electrical Engineering
Computers and Electrical Engineering
Hi-index | 0.00 |
In SMT processors, the complex interplay between private and shared datapath resources needs to be considered in order to realize the full performance potential. In this paper, we show that blindly increasing the size of the per-thread reorder buffers to provide a larger number of in-flight instructions does not result in the expected performance gains but, quite in contrast, degrades the instruction throughput for virtually all multithreaded workloads. The reason for this performance loss is the excessive pressure on the shared datapath resources, especially the instruction scheduling logic. We propose intelligent mechanisms for dynamically adapting the number of reorder buffer entries allocated to each thread in an effort to avoid such allocations if they detrimentally impact the scheduler. We achieve this goal through categorizing the program execution into issue-bound and commit-bound phases and only performing the buffer allocations to the threads operating in commit-bound phases. Our adaptive technique achieves improvements of 21% in instruction throughput and 10% in the fairness metric compared to the best performing baseline configuration with static ROBs.