Low-Overhead, high-speed multi-core barrier synchronization

Authors:
John Sartori;Rakesh Kumar
Affiliations:
Coordinated Science Laboratory, University of Illinois at Urbana-Champaign;Coordinated Science Laboratory, University of Illinois at Urbana-Champaign
Venue:
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
Year:
2010

Citing 17
Cited 2

Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Fast barrier synchronization hardware

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters

IEEE Transactions on Parallel and Distributed Systems
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Efficient techniques for nested and disjoint barrier synchronization

Journal of Parallel and Distributed Computing - Special issue on compilation and architectural support for parallel applications
Reducing coherence overhead of barrier synchronization in software DSMs

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Fast Synchronization on Scalable Cache-Coherent Multiprocessors using Hybrid Primitives

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling

Proceedings of the 32nd annual international symposium on Computer Architecture
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
The M5 Simulator: Modeling Networked Systems

IEEE Micro
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Express virtual channels: towards the ideal interconnection fabric

Proceedings of the 34th annual international symposium on Computer architecture
Software Barrier Performance on Dual Quad-Core Opterons

NAS '08 Proceedings of the 2008 International Conference on Networking, Architecture, and Storage
Efficiency and scalability of barrier synchronization on NoC based many-core architectures

CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Packaging the Blue Gene/L supercomputer

IBM Journal of Research and Development
Approximating k-hop minimum-spanning trees

Operations Research Letters

TLSync: support for multiple fast barriers using on-chip transmission lines

Proceedings of the 38th annual international symposium on Computer architecture
Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe

Quantified Score

Hi-index	0.00

Visualization

Abstract

Whereas efficient barrier implementations were once a concern only in high-performance computing, recent trends in core integration make the topic relevant even for general-purpose CMPs. While the nature of CMP applications requires low-latency, the cost of low-latency barrier implementations using hardware-based techniques can be prohibitive for CMPs, where die area represents opportunities for throughput and yield. Similarly, whereas traditional multiprocessor barrier implementations were developed primarily for dedicated environments, scheduling and multi-programming on CMPs require more adaptable barrier implementations. In this paper, we present and evaluate three barrier implementations that are hybrids of software and dedicated hardware barriers and are specifically tailored for CMPs. The implementations leverage the unique characteristics of CMPs and provide low latency comparable to that of dedicated hardware networks at a fraction of the cost. The implementations also support adaptability, enabling efficient multi-programming and dynamic remapping of the barrier network.