Dynamically dispatching speculative threads to improve sequential execution

  • Authors:
  • Yangchun Luo;Antonia Zhai

  • Affiliations:
  • Advanced Micro Devices, Sunnyvale, CA;University of Minnesota, Minneapolis, MN

  • Venue:
  • ACM Transactions on Architecture and Code Optimization (TACO)
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficiently utilizing multicore processors to improve their performance potentials demands extracting thread-level parallelism from the applications. Various novel and sophisticated execution models have been proposed to extract thread-level parallelism from sequential programs. One such execution model, Thread-Level Speculation (TLS), allows potentially dependent threads to execute speculatively in parallel. However, TLS execution is inherently unpredictable, and consequently incorrect speculation could degrade performance for the multicore systems. Existing approaches have focused on using the compilers to select sequential program regions to apply TLS. Our research shows that even the state-of-the-art compiler makes suboptimal decisions, due to the unpredictability of TLS execution. Thus, we propose to dynamically optimize TLS performance. This article describes the design, implementation, and evaluation of a runtime thread dispatching mechanism that adjusts the behaviors of speculative threads based on their efficiency. In the proposed system, speculative threads are monitored by hardware-based performance counters and their performance impact is evaluated with a novel methodology that takes into account various unique TLS characteristics. Thread dispatching policies are devised to adjust the behaviors of speculative threads accordingly. With the help of the runtime evaluation, where and how to create speculative threads is better determined. Evaluated with all the SPEC CPU2000 benchmark programs written in C, the dynamic dispatching system outperforms the state-of-the-art compiler-based thread management techniques by 9.4% on average. Comparing to sequential execution, we achieve 1.37X performance improvement on a four-core CMP-based system.