Detecting and exploiting causal relationships in hardware shared-memory multiprocessors

  • Authors:
  • Harold W. Cain, III;Mikko H. Lipasti

  • Affiliations:
  • The University of Wisconsin - Madison;The University of Wisconsin - Madison

  • Venue:
  • Detecting and exploiting causal relationships in hardware shared-memory multiprocessors
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This thesis focuses on mechanisms that improve inter-processor communication in hardware shared-memory multiprocessors by detecting and exploiting knowledge of the causal relationships among inter-processor reads and writes to shared memory. We present two applications for exploiting causal dependence knowledge: the avoidance of replays in a novel value-based memory ordering mechanism, and the avoidance of coherence misses in an invalidation-based coherence protocol. Conventional out-of-order processors employ a multiported, fully-associative load queue to guarantee correct memory reference order both within a single thread of execution and across threads in a multiprocessor system. As improvements in process technology and pipelining lead to higher clock frequencies, scaling this complex structure to accommodate a larger number of in-flight loads becomes difficult. The value-based memory ordering mechanism solves the associative load queue scalability problem by completely eliminating the associative load queue. Instead, data dependences and memory consistency constraints are enforced by simply re-executing load instructions in program order prior to retirement. By inferring the existence of causal relationships among processors, the set of loads that must be replayed is filtererd, decreasing the cache bandwidth demands of the load replay mechanism. Consequently, the replay-based mechanism enables a simple, scalable, and energy-efficient FIFO load queue design requiring no associative lookup hardware, while sacrificing only a negligible amount of performance and cache bandwidth. The overhead of inter-processor communication in shared-memory multi-processors is a dominant source of processor stalls for many applications. We present a new edge-chasing algorithm for detecting causal relationships in shared memory multiprocessors, and present an implementation of delayed consistency based on this algorithm that can avoid coherence misses, allowing a processor to continue reading an invalidated cache block until the processor becomes causally dependent upon a newer version of the block. We have shown that edge-chasing delayed consistency can dramatically improve performance for lock-free list manipulation algorithms that operate on highly-contended data structures, and also improve the performance of some commercial workloads, up to 8% for the applications presented in this thesis.