TLSync: support for multiple fast barriers using on-chip transmission lines

Authors:
Jungju Oh;Milos Prvulovic;Alenka Zajic
Affiliations:
Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA;Georgia Institute of Technology, Atlanta, GA, USA
Venue:
Proceedings of the 38th annual international symposium on Computer architecture
Year:
2011

Citing 20
Cited 4

VLSI assist for a multiprocessor

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Fast barrier synchronization hardware

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
An effective synchronization network for hot-spot accesses

ACM Transactions on Computer Systems (TOCS)
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters

IEEE Transactions on Parallel and Distributed Systems
Synchronization and communication in the T3E multiprocessor

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor

Proceedings of the 25th annual international symposium on Computer architecture
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)

ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
TLC: Transmission Line Caches

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A CMOS RF RMS Detector for Built-in Testing of Wireless Transceivers

VTS '05 Proceedings of the 23rd IEEE Symposium on VLSI Test
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication

HOTI '08 Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects
The PARSEC benchmark suite: characterization and architectural implications

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Power reduction of CMP communication networks via RF-interconnects

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Scalability Evaluation of Barrier Algorithms for OpenMP

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Design and implementation of message-passing services for the Blue Gene/L supercomputer

IBM Journal of Research and Development
Efficient and scalable barrier synchronization for many-core CMPs

Proceedings of the 7th ACM international conference on Computing frontiers
Body effect up- and down-conversion mixer circuits for low-voltage ultra-wideband operation

Analog Integrated Circuits and Signal Processing
A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
Low-Overhead, high-speed multi-core barrier synchronization

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers

A case for globally shared-medium on-chip interconnect

Proceedings of the 38th annual international symposium on Computer architecture
Enhancing effective throughput for transmission line-based bus

Proceedings of the 39th Annual International Symposium on Computer Architecture
Design of a collective communication infrastructure for barrier synchronization in cluster-based nanoscale MPSoCs

DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Traffic steering between a low-latency unswitched TL ring and a high-throughput switched on-chip interconnect

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple barriers (e.g., when concurrently executing multiple parallel applications). This paper presents TLSync, a novel hardware barrier implementation that uses the high-frequency part of the spectrum in a transmission-line broadcast network, thus leaving the transmission line network free for non-modulated (baseband) data transmission. In contrast to other implementations of hardware barriers, TLSync allows multiple thread groups to each have its own barrier. This is accomplished by allocating different bands in the radio-frequency spectrum to different groups. Our circuit-level and electromagnetic models show that the worst-case latency for a TLSync barrier is 4ns to 10ns, depending on the size of the frequency band allocated to each group, and our cycle-accurate architectural simulations show that low-latency TLSync barriers provide significant performance and scalability benefits to barrier-intensive applications.