A processor architecture for horizon
Proceedings of the 1988 ACM/IEEE conference on Supercomputing
DISC: dynamic instruction stream computer
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
ICS '90 Proceedings of the 4th international conference on Supercomputing
APRIL: a processor architecture for multiprocessing
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
SMT Layout Overhead and Scalability
IEEE Transactions on Parallel and Distributed Systems
Dynamic Scheduling Issues in SMT Architectures
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
The Impact of Resource Partitioning on SMT Processors
Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques
Design of a Computer—The Control Data 6600
Design of a Computer—The Control Data 6600
The impact of speculative execution on SMT processors
International Journal of Parallel Programming
Paired ROBs: A Cost-Effective Reorder Buffer Sharing Strategy for SMT Processors
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Hi-index | 0.00 |
Simultaneous MultiThreading (SMT) achieves better system resource utilization and higher performance because it exploits Thread-Level Parallelism (TLP) in addition to “conventional” Instruction-Level Parallelism (ILP). Theoretically, system resources in every pipeline stage of an SMT microarchitecture can be dynamically shared. However, in commercial applications, all the major queues are statically partitioned. From an implementation point of view, static partitioning of resources is easier to implement and has a lower hardware overhead and power consumption. In this paper, we strive to quantitatively determine the tradeoff between static partitioning and dynamic sharing. We find that static partitioning of either the instruction fetch queue (IFQ) or the reorder buffer (ROB) is not sufficient if implemented alone (3% and 9% performance decrease respectively in the worst case comparing with dynamic sharing), while statically partitioning both the IFQ and the ROB could achieve an average performance gain of 9% at least, and even reach 148% when running with floating-point benchmarks, when compared with dynamic sharing. We varied the number of functional units in our efforts to isolate the reason for this performance improvement. We found that static partitioning both queues outperformed all the other partitioning mechanisms under the same system configuration. This demonstrates that the performance gain has been achieved by moving from dynamic sharing to static partitioning of the system resources.