Circulating shared-registers for multiprocessor systems

  • Authors:
  • Donald Johnson;David J. Lilja;John Riedl

  • Affiliations:
  • CS Department, Azusa Pacific University, Azusa, CA;Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN;Department of Computer Science, University of Minnesota, Minneapolis, MN

  • Venue:
  • Journal of Systems Architecture: the EUROMICRO Journal
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The techniques for fine-grained data sharing are generally available only on specialized architectures, usually involving a shared-bus. The CIRculating Common-Update Sharing (CIRCUS) mechanism has low latency user-level contention-free access to a set of shared circulating data registers. The local access latency is near zero for both read and write operations. These operations can be mapped into more complex operations, such as arithmetic, logical, or data reduction operations such as minimum or sum to be performed by the circulating register hardware (CRH) on the circulating copy of a register. The CRH can also be used to perform atomic operations, such as fetch&add or swap. For a two-dimensional hierarchy of N processing elements (PEs), the write-latency (until the circulating register is updated with a new value) and the update-latency (when all CRH modules can see the updated value) have an optimum cluster size proportional to (N ċ I/D)1/2, where I is the intercluster time and D is the inter-PE time, including the time between and through one node. The latencies, when optimally clustered, are proportional to (N ċ I ċ D)1/2. Sub-microsecond write-latency is expected for up to 15,255 PEs or 660 workstations. For higher levels of hierarchy, the expected write-latency is shown to be proportional to the sum of the latencies of all loop hierarchies. CIRCUS is applicable to a wide variety of system architectures and topologies.