Fast Barriers for Scalable ccNUMA Systems

  • Authors:
  • Liqun Cheng;John B. Carter

  • Affiliations:
  • University of Utah;University of Utah

  • Venue:
  • ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

As multiprocessors systems become larger and larger and network latency rapidly approaches thousands of processor cycles, the primary factor in determining a barrier algorithmýs performance is the number of serial network latencies it requires. Existing barrier algorithms require at least O(logN) round trip message latencies to perform a single barrier operation on an N-node shared memory multiprocessor. In addition, existing barrier algorithms are not well tuned in terms of how they interact with modern shared memory systems, which leads to an excessive number of message exchanges to signal barrier completion. The contributions of this paper are threefold. First, we identify and quantify the performance deficiencies of conventional barrier implementations when they are executed on real (non-idealized) hardware. Second, we propose a queue-based barrier algorithm that has effectively O(1) time complexity as measured in round trip message latencies. Third, we demonstrate how matching the barrier implementation to the way that modern shared memory systems operate can improve performance dramatically by exploiting a hardware write-update (PUT) mechanism for signaling. The resulting barrier algorithm only costs one serialized round trip message latency to perform a barrier operation across N processors. Using a cycle-accurate execution-driven simulator of a future-generation SGI multiprocessor, we show that with no special hardware support our queue-based barrier outperforms OpenMPýs LL/SC-based barrier implementation by a factor of 7.9 on 256 processors. With hardware that supports a coherent PUT operation, our queue-based barrier outperforms OpenMP barriers by a factor of 94 and outperforms barriers based on SGIýs memory controller-based atomic operations by a factor of 6.5 on 256 processors.