Thread to Core Assignment in SMT On-Chip Multiprocessors

Authors:
Carmelo Acosta;Francisco J. Cazorla;Alex Ramirez;Mateo Valero
Affiliations:
-;-;-;-
Venue:
SBAC-PAD '09 Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing
Year:
2009

Citing 0
Cited 4

Thread to strand binding of parallel network applications in massive multi-threaded systems

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A workload-aware mapping approach for data-parallel programs

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Optimal task assignment in multithreaded processors: a statistical approach

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
L1-bandwidth aware thread allocation in multicore SMT processors

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

State-of-the-art high-performance processors like the IBM POWER5 and Intel i7 show a trend in industry towards on-chip Multiprocessors (CMP) involving Simultaneous Multithreading (SMT) in each core. In these processors, the way in which applications are assigned to cores plays a key role in the performance of each application and the overall system performance. In this paper we show that the system throughput highly depends on the Thread to Core Assignment (TCA), regardless the SMT Instruction Fetch (IFetch) Policy implemented in the cores. Our results indicate that a good TCA can improve the results of any underlying IFetch Policy, yielding speedups of up to 28%. Given the relevance of TCA, we propose an algorithm to manage it in CMP+SMT processors. The proposed throughput-oriented TCA Algorithm takes into account the workload characteristics and the underlying SMT IFetch Policy. Our results show that the TCA Algorithm obtains thread-to-core assignments 3% close to the optimal assignation for each case, yielding system throughput improvements up to 21%.