VLSI assist for a multiprocessor
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Fast barrier synchronization hardware
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
An effective synchronization network for hot-spot accesses
ACM Transactions on Computer Systems (TOCS)
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters
IEEE Transactions on Parallel and Distributed Systems
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
The NYU Ultracomputer—designing a MIMD, shared-memory parallel machine (Extended Abstract)
ISCA '82 Proceedings of the 9th annual symposium on Computer Architecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
A CMOS RF RMS Detector for Built-in Testing of Wireless Transceivers
VTS '05 Proceedings of the 23rd IEEE Symposium on VLSI Test
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
NoC with Near-Ideal Express Virtual Channels Using Global-Line Communication
HOTI '08 Proceedings of the 2008 16th IEEE Symposium on High Performance Interconnects
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Power reduction of CMP communication networks via RF-interconnects
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Scalability Evaluation of Barrier Algorithms for OpenMP
IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Design and implementation of message-passing services for the Blue Gene/L supercomputer
IBM Journal of Research and Development
Efficient and scalable barrier synchronization for many-core CMPs
Proceedings of the 7th ACM international conference on Computing frontiers
Body effect up- and down-conversion mixer circuits for low-voltage ultra-wideband operation
Analog Integrated Circuits and Signal Processing
A case for globally shared-medium on-chip interconnect
Proceedings of the 38th annual international symposium on Computer architecture
Low-Overhead, high-speed multi-core barrier synchronization
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A case for globally shared-medium on-chip interconnect
Proceedings of the 38th annual international symposium on Computer architecture
Enhancing effective throughput for transmission line-based bus
Proceedings of the 39th Annual International Symposium on Computer Architecture
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Hi-index | 0.00 |
As the number of cores on a single-chip grows, scalable barrier synchronization becomes increasingly difficult to implement. In software implementations, such as the tournament barrier, a larger number of cores results in a longer latency for each round and a larger number of rounds. Hardware barrier implementations require significant dedicated wiring, e.g., using a reduction (arrival) tree and a notification (release) tree, and multiple instances of this wiring are needed to support multiple barriers (e.g., when concurrently executing multiple parallel applications). This paper presents TLSync, a novel hardware barrier implementation that uses the high-frequency part of the spectrum in a transmission-line broadcast network, thus leaving the transmission line network free for non-modulated (baseband) data transmission. In contrast to other implementations of hardware barriers, TLSync allows multiple thread groups to each have its own barrier. This is accomplished by allocating different bands in the radio-frequency spectrum to different groups. Our circuit-level and electromagnetic models show that the worst-case latency for a TLSync barrier is 4ns to 10ns, depending on the size of the frequency band allocated to each group, and our cycle-accurate architectural simulations show that low-latency TLSync barriers provide significant performance and scalability benefits to barrier-intensive applications.