Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Handling long-latency loads in a simultaneous multithreading processor
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Computer
Power4 System Design for High Reliability
IEEE Micro
Boosting SMT Performance by Speculation Control
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Transparent Threads: Resource Sharing in SMT Processors for High Single-Thread Performance
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Front-End Policies for Improved Issue Efficiency in SMT Processors
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Dynamically Controlled Resource Allocation in SMT Processors
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Learning-Based SMT Processor Resource Distribution via Hill-Climbing
Proceedings of the 33rd annual international symposium on Computer Architecture
POWER5 System microarchitecture
IBM Journal of Research and Development - POWER5 and packaging
Predictable Performance in SMT Processors: Synergy between the OS and SMTs
IEEE Transactions on Computers
SPEC CPU2006 benchmark descriptions
ACM SIGARCH Computer Architecture News
FAME: FAirly MEasuring Multithreaded Architectures
PACT '07 Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques
IBM Journal of Research and Development
A dynamic scheduler for balancing HPC applications
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Per-thread cycle accounting in SMT processors
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Probabilistic job symbiosis modeling for SMT processor scheduling
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Load balancing using dynamic cache allocation
Proceedings of the 7th ACM international conference on Computing frontiers
Power and thermal characterization of POWER6 system
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Probabilistic modeling for job symbiosis scheduling on SMT processors
ACM Transactions on Architecture and Code Optimization (TACO)
Enhancing the performance of assisted execution runtime systems through hardware/software techniques
Proceedings of the 26th ACM international conference on Supercomputing
Making data prefetch smarter: adaptive prefetching on POWER7
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Assessing the suitability of the NGMP multi-core processor in the space domain
Proceedings of the tenth ACM international conference on Embedded software
Hi-index | 0.00 |
Due to the limitations of instruction-level parallelism, thread-level parallelism has become a popular way to improve processor performance. One example is the IBM POWER5TM processor, a two-context simultaneous-multithreaded dual-core chip. In each SMT core, the IBM POWER5 features two levels of thread resource balancing and prioritization. The first level provides automatic in-hardware resource balancing, while the second level is a software-controlled priority mechanism that presents eight levels of thread priorities. Currently, software-controlled prioritization is only used in limited number of cases in the software platforms due to lack of performance characterization of the effects of this mechanism. In this work, we characterize the effects of the software-based prioritization on several different workloads. We show that the impact of the prioritization significantly depends on the workloads coscheduled on a core. By prioritizing the right task, it is possible to obtain more than two times of throughput improvement for synthetic workloads compared to the baseline. We also present two application case studies targeting two different performance metrics: the first case study improves overall throughput by 23.7% and the second case study reduces the total execution time by 9.3%. In addition, we show the circumstances when a background thread can be run transparently without affecting the performance of the foreground thread.