Performance analysis of thread mappings with a holistic view of the hardware resources

  • Authors:
  • Wei Wang;Tanima Dey;Jason Mars;Lingjia Tang;Jack W. Davidson;Mary Lou Soffa

  • Affiliations:
  • Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA;Department of Computer Science, University of Virginia, Charlottesville, 22904, USA

  • Venue:
  • ISPASS '12 Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the shift to chip multiprocessors, managing shared resources has become a critical issue in realizing their full potential. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads have impeded the efficiency of thread mapping algorithms. To overcome the difficulties of simultaneously managing multiple resources with thread mapping, the interaction between various microarchitectural resources and thread characteristics must be well understood. This paper presents an in-depth analysis of PARSEC benchmarks running under different thread mappings to investigate the interaction of various thread mappings with microarchitectural resources including, L1 I/D-caches, I/D TLBs, L2 caches, hardware prefetchers, off-chip memory interconnects, branch predictors, memory disambiguation units and the cores. For each resource, the analysis provides guidelines for how to improve its utilization when mapping threads with different characteristics. We also analyze how the relative importance of the resources varies depending on the workloads. Our experiments show that when only memory resources are considered, thread mapping improves an application's performance by as much as 14% over the default Linux scheduler. In contrast, when both memory and processor resources are considered the mapping algorithm achieves performance improvements by as much as 28%. Additionally, we demonstrate that thread mapping should consider L2 caches, prefetchers and off-chip memory interconnects as one resource, and we present a new metric called L2-misses-memory-latency-product (L2MP) for evaluating their aggregated performance impact.