Bulldog: a compiler for VLSI architectures
Bulldog: a compiler for VLSI architectures
Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
High-performance computer architecture
High-performance computer architecture
Distributing Hot-Spot Addressing in Large-Scale Multiprocessors
IEEE Transactions on Computers
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers
IEEE Transactions on Computers
Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Guide to parallel programming on Sequent computer systems: 2nd edition
Guide to parallel programming on Sequent computer systems: 2nd edition
Reduced instruction set computers
Communications of the ACM - Special section on computer architecture
Postpass Code Optimization of Pipeline Constraints
ACM Transactions on Programming Languages and Systems (TOPLAS)
Dependence graphs and compiler optimizations
POPL '81 Proceedings of the 8th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Register allocation and code scheduling for load/store architectures
Register allocation and code scheduling for load/store architectures
A reconfigurable liw architecture and its compiler
A reconfigurable liw architecture and its compiler
Quartz: a tool for tuning parallel program performance
SIGMETRICS '90 Proceedings of the 1990 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Employing register channels for the exploitation of instruction level parallelism
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Fast barrier synchronization hardware
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Loop displacement: an approach for transforming and scheduling loops for parallel execution
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The design of a RISC based multiprocessor chip
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Executing loops on a fine-grained MIMD architecture
MICRO 24 Proceedings of the 24th annual international symposium on Microarchitecture
Subset barrier synchronization on a private-memory parallel system
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
The network architecture of the Connection Machine CM-5 (extended abstract)
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters
IEEE Transactions on Parallel and Distributed Systems
Global Virtual Time and distributed synchronization
PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
Parallel algorithms for the circuit value update problem
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Efficient techniques for fast nested barrier synchronization
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
An evaluation of memory consistency models for shared-memory systems with ILP processors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A run time support system for multiprocessor machines
ICS '90 Proceedings of the 4th international conference on Supercomputing
A fine-grained MIMD architecture based upon register channels
MICRO 23 Proceedings of the 23rd annual workshop and symposium on Microprogramming and microarchitecture
Four-Ary Tree-Based Barrier Synchronization for 2D Meshes without Nonmember Involvement
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Speculative synchronization: applying thread-level speculation to explicitly parallel applications
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Two-Phase Barrier: A Synchronization Primitive for Improving the Processor Utilization
International Journal of Parallel Programming
Performance Benefits of NIC-Based Barrier on Myrinet/GM
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Fast NIC-Based Barrier over Myrinet/GM
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Integrated Network Barriers for D-Dimensional Meshes
PACT '93 Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism
A quasi-barrier technique to improve performance of an irregular application
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Fast barrier synchronization in wormhole k-ary n-cube networks with multidestination worms
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Performance Debugger for Eliminating Excess Synchronization in Shared-Memory Parallel Programs
MASCOTS '96 Proceedings of the 4th International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems
Using early phase termination to eliminate load imbalances at barrier synchronization points
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Performance of memory reclamation for lockless synchronization
Journal of Parallel and Distributed Computing
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization
Proceedings of the 22nd annual international conference on Supercomputing
Dynamic recognition of synchronization operations for improved data race detection
ISSTA '08 Proceedings of the 2008 international symposium on Software testing and analysis
Chunking parallel loops in the presence of synchronization
Proceedings of the 23rd international conference on Supercomputing
ECMon: exposing cache events for monitoring
Proceedings of the 36th annual international symposium on Computer architecture
Making lockless synchronization fast: performance implications of memory reclamation
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
ISPDC'03 Proceedings of the Second international conference on Parallel and distributed computing
Comparing the usability of library vs. language approaches to task parallelism
Evaluation and Usability of Programming Languages and Tools
Hiding latency in Coarray Fortran 2.0
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
Unifying barrier and point-to-point synchronization in OpenMP with phasers
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Habanero-Java extensions for scientific computing
Proceedings of the 9th Workshop on Parallel/High-Performance Object-Oriented Scientific Computing
Habanero-Java: the new adventures of old X10
Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
DrHJ: a lightweight pedagogic IDE for Habanero Java
Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Intermediate language extensions for parallelism
Proceedings of the compilation of the co-located workshops on DSM'11, TMC'11, AGERE!'11, AOOPES'11, NEAT'11, & VMIL'11
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Speculative optimizations for parallel programs on multicores
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
A Transformation Framework for Optimizing Task-Parallel Programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
Interference resilient PDES on multi-core systems: towards proportional slowdown
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
On Automation in the Verification of Software Barriers: Experience Report
Journal of Automated Reasoning
Parallel Algorithms for the Circuit Value Update Problem
Theory of Computing Systems
Hi-index | 0.00 |
Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using shared variables has two major drawbacks. Firstly, the execution of the barrier may be slow as it may not only require execution of several instructions and but also result in hot-spot accesses. Secondly, processors that are stalled waiting for other processors to reach the barrier are essentially idling and cannot do any useful work. In this paper, the notion of the fuzzy barrier is presented, that avoids the above drawbacks. The first problem is avoided by implementing the mechanism in hardware. The second problem is solved by extending the barrier concept to include a region of statements that can be executed by a processor while it awaits synchronization. The barrier regions are constructed by a compiler and consist of several instructions such that a processor is ready to synchronize upon reaching the first instruction in this region and must synchronize before exiting the region. When synchronization does occur, the processors could be executing at any point in their respective barrier regions. The larger the barrier region, the more likely it is that none of the processors will have to stall. Preliminary investigations show that barrier regions can be large and the use of program transformations can significantly increase their size. Examples of situations where such a mechanism can result in improved performance are presented. Results based on a software implementation of the fuzzy barrier on the Encore multiprocessor indicate that the synchronization overhead can be greatly reduced using the mechanism.