Optimizing thread throughput for multithreaded workloads on memory constrained CMPs

  • Authors:
  • Major Bhadauria;Sally A. McKee

  • Affiliations:
  • Cornell University, Ithaca, NY, USA;Cornell University, Ithaca, NY, USA

  • Venue:
  • Proceedings of the 5th conference on Computing frontiers
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multi-core designs have become the industry imperative, replacing our reliance on increasingly complicated micro-architectural designs and VLSI improvements to deliver increased performance at lower power budgets. Performance of these multi-core chips will be limited by the DRAM memory system: we demonstrate this by modeling a cycle-accurate DDR2 memory controller with SPLASH-2 workloads. Surprisingly, benchmarks that appear to scale well with the number of processors fail to do so when memory is accurately modeled. We frequently find that the most efficient configuration is not the one with the most threads. By choosing the most efficient number of threads for each benchmark, average energy delay efficiency improves by a factor of 3.39, and performance improves by 19.7%, on average. We also introduce a shadow row of sense amplifiers, an alternative to cached DRAM, to explore potential power/performance impacts. The shadow row works in conjunction with the L2 Cache to leverage temporal and spatial locality across memory accesses, thus attaining average and peak speedups of 13% and 43%, respectively, when compared to a state-of-the-art DRAM memory scheduler.