Scalable barrier synchronisation for large-scale shared-memory multiprocessors

Authors:
Zhen Fang;Lixin Zhang;John B. Carter;Mike Parker
Affiliations:
School of Computing, University of Utah, Salt Lake City, UT 84112, USA.;IBM Austin Research Lab, 11400 Burnet Rd, MS 904/6C019, Austin, TX 78758., USA.;School of Computing, University of Utah, Salt Lake City, UT 84112, USA.;Cray, Inc., 1050 Lowater Road Chippewa Falls, WI 54729, USA
Venue:
International Journal of High Performance Computing and Networking
Year:
2004

Citing 20
Cited 0

Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Applications considerations in the system design of highly concurrent multiprocessors

IEEE Transactions on Computers
Efficient synchronization of multiprocessors with shared memory

ACM Transactions on Programming Languages and Systems (TOPLAS)
A scalable implementation of barrier synchronization using an adaptive combining tree

International Journal of Parallel Programming
MIPS RISC architectures

MIPS RISC architectures
Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The PowerPC architecture: a specification for a new family of RISC processors

The PowerPC architecture: a specification for a new family of RISC processors
Fast, contention-free combining tree barriers for shared-memory multiprocessors

International Journal of Parallel Programming
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters

IEEE Transactions on Parallel and Distributed Systems
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Design and evaluation of dynamic access ordering hardware

ICS '96 Proceedings of the 10th international conference on Supercomputing
Coherence controller architectures for SMP-based CC-NUMA multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Active pages: a computation model for intelligent memory

Proceedings of the 25th annual international symposium on Computer architecture
Mapping irregular applications to DIVA, a PIM-based data-intensive architecture

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
The Impulse Memory Controller

IEEE Transactions on Computers
The Architectural and Operating System Implications on the Performance of Synchronization on ccNUMA Multiprocessors

International Journal of Parallel Programming
Pixel Processing in a Memory Controller

IEEE Computer Graphics and Applications
A Case for Intelligent RAM

IEEE Micro
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Barrier synchronisation is very important in scalable multiprocessors. As network latency rapidly approaches thousands of processor cycles and multiprocessors systems become larger and larger, conventional barrier techniques are failing to keep up with the increasing demand for efficient synchronisation. In this paper, we present a memory controller-based operation that optimises the barrier function of an OpenMP library. The proposed mechanism allows atomic operations on the barrier variable to be executed on the home memory controller and the home memory controller to send fine-grained updates to waiting processors when a barrier variable reaches certain values. On a cycle-accurate execution-driven simulator, experiment results show that the proposed barrier implementation outperforms a conventional LL/SC (Load-Linked/ Store-Conditional) version by 20.8X, a conventional processor-side atomic instruction version by 15.5X, and an active messages version by 13.4X. To the best of our knowledge, the proposed barrier achieves better performance than all other existing non-hardwired implementations, and with an improved programming interface.