The fuzzy barrier: a mechanism for high speed synchronization of processors

Authors:
Rajiv Gupta
Affiliations:
Philips Laboratories, North American Philips Corporation, 345 Scarborough Road, Briarcliff Manor, NY
Venue:
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Year:
1989

Citing 12
Cited 46

Bulldog: a compiler for VLSI architectures

Bulldog: a compiler for VLSI architectures
Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
High-performance computer architecture

High-performance computer architecture
Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Guide to parallel programming on Sequent computer systems: 2nd edition

Guide to parallel programming on Sequent computer systems: 2nd edition
Reduced instruction set computers

Communications of the ACM - Special section on computer architecture
Postpass Code Optimization of Pipeline Constraints

ACM Transactions on Programming Languages and Systems (TOPLAS)
Dependence graphs and compiler optimizations

POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Register allocation and code scheduling for load/store architectures

Register allocation and code scheduling for load/store architectures
A reconfigurable liw architecture and its compiler

A reconfigurable liw architecture and its compiler

Quartz: a tool for tuning parallel program performance

SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Employing register channels for the exploitation of instruction level parallelism

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Fast barrier synchronization hardware

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Loop displacement: an approach for transforming and scheduling loops for parallel execution

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The design of a RISC based multiprocessor chip

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Executing loops on a fine-grained MIMD architecture

MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Subset barrier synchronization on a private-memory parallel system

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters

IEEE Transactions on Parallel and Distributed Systems
Global Virtual Time and distributed synchronization

PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
Parallel algorithms for the circuit value update problem

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Efficient techniques for fast nested barrier synchronization

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A run time support system for multiprocessor machines

ICS '90 Proceedings of the 4th international conference on Supercomputing
A fine-grained MIMD architecture based upon register channels

MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Four-Ary Tree-Based Barrier Synchronization for 2D Meshes without Nonmember Involvement

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Speculative synchronization: applying thread-level speculation to explicitly parallel applications

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Two-Phase Barrier: A Synchronization Primitive for Improving the Processor Utilization

International Journal of Parallel Programming
Performance Benefits of NIC-Based Barrier on Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Integrated Network Barriers for D-Dimensional Meshes

PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
A quasi-barrier technique to improve performance of an irregular application

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs

MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Using early phase termination to eliminate load imbalances at barrier synchronization points

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Performance of memory reclamation for lockless synchronization

Journal of Parallel and Distributed Computing
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

Proceedings of the 22nd annual international conference on Supercomputing
Dynamic recognition of synchronization operations for improved data race detection

ISSTA '08 Proceedings of the 2008 international symposium on Software testing and analysis
Chunking parallel loops in the presence of synchronization

Proceedings of the 23rd international conference on Supercomputing
ECMon: exposing cache events for monitoring

Proceedings of the 36th annual international symposium on Computer architecture
Making lockless synchronization fast: performance implications of memory reclamation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Graphical design tool for parallel programs with execution control based on global application states

ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Comparing the usability of library vs. language approaches to task parallelism

Evaluation and Usability of Programming Languages and Tools
Hiding latency in Coarray Fortran 2.0

Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Unifying barrier and point-to-point synchronization in OpenMP with phasers

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Habanero-Java extensions for scientific computing

Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Habanero-Java: the new adventures of old X10

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
DrHJ: a lightweight pedagogic IDE for Habanero Java

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Intermediate language extensions for parallelism

Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
Programming with intervals

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Speculative optimizations for parallel programs on multicores

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A Transformation Framework for Optimizing Task-Parallel Programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
Interference resilient PDES on multi-core systems: towards proportional slowdown

Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
On Automation in the Verification of Software Barriers: Experience Report

Journal of Automated Reasoning
Parallel Algorithms for the Circuit Value Update Problem

Theory of Computing Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.