Exploring the performance limits of simultaneous multithreading for memory intensive applications

  • Authors:
  • Evangelia Athanasaki;Nikos Anastopoulos;Kornilios Kourtis;Nectarios Koziris

  • Affiliations:
  • School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773

  • Venue:
  • The Journal of Supercomputing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Simultaneous multithreading (SMT) has been proposed to improve system throughput by overlapping instructions from multiple threads on a single wide-issue processor. Recent studies have demonstrated that diversity of simultaneously executed applications can bring up significant performance gains due to SMT. However, the speedup of a single application that is parallelized into multiple threads, is often sensitive to its inherent instruction level parallelism (ILP), as well as the efficiency of synchronization and communication mechanisms between its separate, but possibly dependent threads. Moreover, as these separate threads tend to put pressure on the same architectural resources, no significant speedup can be observed.In this paper, we evaluate and contrast thread-level parallelism (TLP) and speculative precomputation (SPR) techniques for a series of memory intensive codes executed on a specific SMT processor implementation. We explore the performance limits by evaluating the tradeoffs between ILP and TLP for various kinds of instruction streams. By obtaining knowledge on how such streams interact when executed simultaneously on the processor, and quantifying their presence within each application's threads, we try to interpret the observed performance for each application when parallelized according to the aforementioned techniques. In order to amplify this evaluation process, we also present results gathered from the performance monitoring hardware of the processor.