Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Low-cost and energy-efficient distributed synchronization for embedded multiprocessors
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Hi-index | 0.00 |
As multiprocessors systems become larger and larger and network latency rapidly approaches thousands of processor cycles, the primary factor in determining a barrier algorithmýs performance is the number of serial network latencies it requires. Existing barrier algorithms require at least O(logN) round trip message latencies to perform a single barrier operation on an N-node shared memory multiprocessor. In addition, existing barrier algorithms are not well tuned in terms of how they interact with modern shared memory systems, which leads to an excessive number of message exchanges to signal barrier completion. The contributions of this paper are threefold. First, we identify and quantify the performance deficiencies of conventional barrier implementations when they are executed on real (non-idealized) hardware. Second, we propose a queue-based barrier algorithm that has effectively O(1) time complexity as measured in round trip message latencies. Third, we demonstrate how matching the barrier implementation to the way that modern shared memory systems operate can improve performance dramatically by exploiting a hardware write-update (PUT) mechanism for signaling. The resulting barrier algorithm only costs one serialized round trip message latency to perform a barrier operation across N processors. Using a cycle-accurate execution-driven simulator of a future-generation SGI multiprocessor, we show that with no special hardware support our queue-based barrier outperforms OpenMPýs LL/SC-based barrier implementation by a factor of 7.9 on 256 processors. With hardware that supports a coherent PUT operation, our queue-based barrier outperforms OpenMP barriers by a factor of 94 and outperforms barriers based on SGIýs memory controller-based atomic operations by a factor of 6.5 on 256 processors.