Efficient synchronization primitives for large-scale cache-coherent multiprocessors

Authors:
James R. Goodman;Mary K. Vernon;Philip J. Woest
Affiliations:
Univ. of Wisconsin-Madison, Madison;Univ. of Wisconsin-Madison, Madison;Univ. of Wisconsin-Madison, Madison
Venue:
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Year:
1989

Citing 12
Cited 93

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Multiprocessor cache synchronization: issues, innovations, evolution

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
A Scheme to Enforce Data Dependence on Large Multiprocessor Systems

IEEE Transactions on Software Engineering
The butterfly barrier

International Journal of Parallel Programming
Distributing Hot-Spot Addressing in Large-Scale Multiprocessors

IEEE Transactions on Computers
Applications considerations in the system design of highly concurrent multiprocessors

IEEE Transactions on Computers
The Wisconsin multicube: a new large-scale cache-coherent multiprocessor

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A mean-value performance analysis of a new multiprocessor architecture

SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Guide to parallel programming on Sequent computer systems: 2nd edition

Guide to parallel programming on Sequent computer systems: 2nd edition
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors

ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance measurements on HEP - a pipelined MIMD computer

ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture

Using feedback to control tree saturation in multistage interconnection networks

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Cache considerations for multiprocessor programmers

Communications of the ACM
Synchronization Algorithms for Shared-Memory Multiprocessors

Computer
Scalable coherent interface

Computer
Stanford distributed-directory protocol

Computer
Counting networks and multi-processor coordination

STOC '91 Proceedings of the twenty-third annual ACM symposium on Theory of computing
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Process coordination with fetch-and-increment

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Synchronization without contention

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor

Computer
Low contention load balancing on large-scale multiprocessors

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Cache Invalidation Patterns in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Hardware combining and scalability

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
A performance evaluation of optimal hybrid cache coherency protocols

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessor

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons

ACM Computing Surveys (CSUR)
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
Fast, scalable synchronization with minimal hardware support

PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Mechanisms for cooperative shared memory

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Transactional memory: architectural support for lock-free data structures

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Compiling for shared-memory and message-passing computers

ACM Letters on Programming Languages and Systems (LOPLAS)
Diffracting trees (preliminary version)

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Counting networks

Journal of the ACM (JACM)
Request Combining in Multiprocessors with Arbitrary Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
Reactive synchronization algorithms for multiprocessors

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters

IEEE Transactions on Parallel and Distributed Systems
Scalable concurrent counting

ACM Transactions on Computer Systems (TOCS)
Elimination trees and the construction of pools and stacks: preliminary version

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The communication requirements of mutual exclusion

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Diffracting trees

ACM Transactions on Computer Systems (TOCS)
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A steady state analysis of diffracting trees (extended abstract)

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
The GLOW cache coherence protocol extensions for widely shared data

ICS '96 Proceedings of the 10th international conference on Supercomputing
Data Forwarding in Scalable Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
An efficient caching support for critical sections in large-scale shared-memory multiprocessors

ICS '90 Proceedings of the 4th international conference on Supercomputing
Reactive diffracting trees

Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
An inherent bottleneck in distributed counting

PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Synchronization transformations for parallel computing

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Hardware fault containment in scalable shared-memory multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB

Proceedings of the 24th annual international symposium on Computer architecture
Contention in shared memory algorithms

Journal of the ACM (JACM)
Combining funnels: a new twist on an old tale…

PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors

ICS '98 Proceedings of the 12th international conference on Supercomputing
Weak ordering—a new definition

25 years of the international symposia on Computer architecture (selected papers)
Scalable concurrent priority queue algorithms

Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Restricted Fetch and Φ operations for parallel processing

ICS '89 Proceedings of the 3rd international conference on Supercomputing
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
PLUS: a distributed shared-memory system

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Adaptive software cache management for distributed shared memory architectures

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
System-on-a-chip processor synchronization support in hardware

Proceedings of the conference on Design, automation and test in Europe
Adding networks

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Transactional lock-free execution of lock-based programs

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
The Use of Feedback in Multiprocessors and Its Application to Tree Saturation Control

IEEE Transactions on Parallel and Distributed Systems
Design Considerations for Shared Memory Multiprocessor Message Systems

IEEE Transactions on Parallel and Distributed Systems
Performance of Pruning-Cache Directories for Large-Scale Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A Circular List-Based Mutual Exclusion Scheme for Large Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Adding Networks

DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Efficient synchronization for nonuniform communication architectures

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Hierarchical Backoff Locks for Nonuniform Communication Architectures

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications

IEEE Transactions on Parallel and Distributed Systems
A scalable lock-free stack algorithm

Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
The counting pyramid: an adaptive distributed counting scheme

Journal of Parallel and Distributed Computing
Read-modify-write networks

Distributed Computing
Linearizable counting networks

Distributed Computing
Using elimination to implement scalable and lock-free FIFO queues

Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
A case study of multi-threading in the embedded space

CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Lightweight lock-free synchronization methods for multithreading

Proceedings of the 20th annual international conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
The cost of concurrent, low-contention Read&Modify&Write

Theoretical Computer Science - Foundations of software science and computation structures
The WaveScalar architecture

ACM Transactions on Computer Systems (TOCS)
Self-tuning reactive diffracting trees

Journal of Parallel and Distributed Computing
The mechanics of in-kernel synchronization for a scalable microkernel

ACM SIGOPS Operating Systems Review
SNZI: scalable NonZero indicators

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Techniques for efficient placement of synchronization primitives

Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A scalable lock-free stack algorithm

Journal of Parallel and Distributed Computing
Fast barrier synchronization for InfiniBand™

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architectural Support for Fair Reader-Writer Locking

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
An adaptive technique for constructing robust and high-throughput shared objects

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
CAFÉ: scalable task pools with adjustable fairness and contention

DISC'11 Proceedings of the 25th international conference on Distributed computing
Speeding-up synchronizations in DSM multiprocessors

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Constructing shared objects that are both robust and high-throughput

DISC'06 Proceedings of the 20th international conference on Distributed Computing
Revisiting the combining synchronization technique

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Support for fine-grained synchronization in shared-memory multiprocessors

PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Efficient fetch-and-increment

DISC'12 Proceedings of the 26th international conference on Distributed Computing
Location-aware cache management for many-core processors with deep cache hierarchy

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Elimination Trees and the Construction of Pools and Stacks

Theory of Computing Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that the interconnect supports broadcast.The primitives make use of synchronization bits (syncbits) to provide a simple mechanism for mutual exclusion. The proposed implementation of the primitives includes efficient (i.e. local) busy-waiting for syncbits. In addition, a hardware-supported mechanism for maintaining a first-come first-serve queue of requests for a syncbit is proposed. This queueing mechanism allows for a very efficient implementation of, as well as fair access to, binary semaphores. We also propose to implement Fetch and Add with combining in software rather than hardware. This allows an architecture to scale to a large number of processors while avoiding the cost of hardware combining.Scenarios for common synchronization events such as work queues and barriers are presented to demonstrate the generality and ease of use of the proposed primitives. The efficient implementation of the primitives is simpler if the multiprocessor has a hardware cache-consistency protocol. To illustrate this point, we outline how the primitives would be implemented in the Multicube multiprocessor [GoWo88].