Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs

  • Authors:
  • M. Aater Suleman;Moinuddin K. Qureshi;Yale N. Patt

  • Affiliations:
  • The University of Texas at Austin, Austin, TX;T. J. Watson Research Center, Yorktown Hieghts, NY;The University of Texas at Austin, Austin, TX

  • Venue:
  • Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by data-synchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an application gets limited by data-synchronization, or bandwidth, or neither depends not only on the application but also on the input set and the machine configuration. Therefore, controlling the number of threads based on the run-time behavior of the application can significantly improve performance and reduce power. This paper proposes Feedback-Driven Threading (FDT), a framework to dynamically control the number of threads using run-time information. FDT can be used to implement Synchronization-Aware Threading (SAT), which predicts the optimal number of threads depending on the amount of data-synchronization. Our evaluation shows that SAT can reduce both execution time and power by up to 66% and 78% respectively. Similarly, FDT can be used to implement Bandwidth-Aware Threading (BAT), which predicts the minimum number of threads required to saturate the off-chip bus. Our evaluation shows that BAT reduces on-chip power by up to 78%. When SAT and BAT are combined, the average execution time reduces by 17% and power reduces by 59%. The proposed techniques leverage existing performance counters and require minimal support from the threading library.