Performance analysis of thread mappings with a holistic view of the hardware resources

Authors:
Wei Wang;Tanima Dey;Jason Mars;Lingjia Tang;Jack W. Davidson;Mary Lou Soffa
Affiliations:
Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA
Venue:
ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
Year:
2012

Citing 0
Cited 2

Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Energy-aware thread co-location in heterogeneous multicore processors

Proceedings of the Eleventh ACM International Conference on Embedded Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the shift to chip multiprocessors, managing shared resources has become a critical issue in realizing their full potential. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads have impeded the efficiency of thread mapping algorithms. To overcome the difficulties of simultaneously managing multiple resources with thread mapping, the interaction between various microarchitectural resources and thread characteristics must be well understood. This paper presents an in-depth analysis of PARSEC benchmarks running under different thread mappings to investigate the interaction of various thread mappings with microarchitectural resources including, L1 I/D-caches, I/D TLBs, L2 caches, hardware prefetchers, off-chip memory interconnects, branch predictors, memory disambiguation units and the cores. For each resource, the analysis provides guidelines for how to improve its utilization when mapping threads with different characteristics. We also analyze how the relative importance of the resources varies depending on the workloads. Our experiments show that when only memory resources are considered, thread mapping improves an application's performance by as much as 14% over the default Linux scheduler. In contrast, when both memory and processor resources are considered the mapping algorithm achieves performance improvements by as much as 28%. Additionally, we demonstrate that thread mapping should consider L2 caches, prefetchers and off-chip memory interconnects as one resource, and we present a new metric called L2-misses-memory-latency-product (L2MP) for evaluating their aggregated performance impact.