Cache coherence protocols: evaluation using a multiprocessor simulation model
ACM Transactions on Computer Systems (TOCS)
Multiprocessor cache synchronization: issues, innovations, evolution
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
A Scheme to Enforce Data Dependence on Large Multiprocessor Systems
IEEE Transactions on Software Engineering
International Journal of Parallel Programming
Distributing Hot-Spot Addressing in Large-Scale Multiprocessors
IEEE Transactions on Computers
Applications considerations in the system design of highly concurrent multiprocessors
IEEE Transactions on Computers
The Wisconsin multicube: a new large-scale cache-coherent multiprocessor
ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
A mean-value performance analysis of a new multiprocessor architecture
SIGMETRICS '88 Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Guide to parallel programming on Sequent computer systems: 2nd edition
Guide to parallel programming on Sequent computer systems: 2nd edition
ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance measurements on HEP - a pipelined MIMD computer
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Dynamic decentralized cache schemes for mimd parallel processors
ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Using feedback to control tree saturation in multistage interconnection networks
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Cache considerations for multiprocessor programmers
Communications of the ACM
Computer
Counting networks and multi-processor coordination
STOC '91 Proceedings of the twenty-third annual ACM symposium on Theory of computing
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Process coordination with fetch-and-increment
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Synchronization without contention
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
Low contention load balancing on large-scale multiprocessors
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Cache Invalidation Patterns in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Hardware combining and scalability
SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
A performance evaluation of optimal hybrid cache coherency protocols
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cooperative shared memory: software and hardware for scalable multiprocessor
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Cache coherence in large-scale shared-memory multiprocessors: issues and comparisons
ACM Computing Surveys (CSUR)
Cooperative shared memory: software and hardware for scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
Fast, scalable synchronization with minimal hardware support
PODC '93 Proceedings of the twelfth annual ACM symposium on Principles of distributed computing
Adaptive cache coherency for detecting migratory shared data
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Mechanisms for cooperative shared memory
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Transactional memory: architectural support for lock-free data structures
ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Compiling for shared-memory and message-passing computers
ACM Letters on Programming Languages and Systems (LOPLAS)
Diffracting trees (preliminary version)
SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
Journal of the ACM (JACM)
Request Combining in Multiprocessors with Arbitrary Interconnection Networks
IEEE Transactions on Parallel and Distributed Systems
Reactive synchronization algorithms for multiprocessors
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Distributed Hardwired Barrier Synchronization for Scalable Multiprocessor Clusters
IEEE Transactions on Parallel and Distributed Systems
ACM Transactions on Computer Systems (TOCS)
Elimination trees and the construction of pools and stacks: preliminary version
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
The communication requirements of mutual exclusion
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
ACM Transactions on Computer Systems (TOCS)
An evaluation of memory consistency models for shared-memory systems with ILP processors
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
A steady state analysis of diffracting trees (extended abstract)
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
The GLOW cache coherence protocol extensions for widely shared data
ICS '96 Proceedings of the 10th international conference on Supercomputing
Data Forwarding in Scalable Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
An efficient caching support for critical sections in large-scale shared-memory multiprocessors
ICS '90 Proceedings of the 4th international conference on Supercomputing
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
An inherent bottleneck in distributed counting
PODC '97 Proceedings of the sixteenth annual ACM symposium on Principles of distributed computing
Synchronization transformations for parallel computing
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Hardware fault containment in scalable shared-memory multiprocessors
Proceedings of the 24th annual international symposium on Computer architecture
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
Contention in shared memory algorithms
Journal of the ACM (JACM)
Combining funnels: a new twist on an old tale…
PODC '98 Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing
A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors
ICS '98 Proceedings of the 12th international conference on Supercomputing
Weak ordering—a new definition
25 years of the international symposia on Computer architecture (selected papers)
Scalable concurrent priority queue algorithms
Proceedings of the eighteenth annual ACM symposium on Principles of distributed computing
Restricted Fetch and Φ operations for parallel processing
ICS '89 Proceedings of the 3rd international conference on Supercomputing
Weak ordering—a new definition
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
PLUS: a distributed shared-memory system
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Adaptive software cache management for distributed shared memory architectures
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
System-on-a-chip processor synchronization support in hardware
Proceedings of the conference on Design, automation and test in Europe
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Transactional lock-free execution of lock-based programs
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
The Use of Feedback in Multiprocessors and Its Application to Tree Saturation Control
IEEE Transactions on Parallel and Distributed Systems
Design Considerations for Shared Memory Multiprocessor Message Systems
IEEE Transactions on Parallel and Distributed Systems
Performance of Pruning-Cache Directories for Large-Scale Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
A Circular List-Based Mutual Exclusion Scheme for Large Shared-Memory Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
DISC '01 Proceedings of the 15th International Conference on Distributed Computing
Efficient synchronization for nonuniform communication architectures
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Inferential queueing and speculative push for reducing critical communication latencies
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Thread prioritization: a thread scheduling mechanism for multiple-context parallel processors
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
Hierarchical Backoff Locks for Nonuniform Communication Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
The Impact of Negative Acknowledgments in Shared Memory Scientific Applications
IEEE Transactions on Parallel and Distributed Systems
A scalable lock-free stack algorithm
Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures
The counting pyramid: an adaptive distributed counting scheme
Journal of Parallel and Distributed Computing
Distributed Computing
Linearizable counting networks
Distributed Computing
Using elimination to implement scalable and lock-free FIFO queues
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Store Memory-Level Parallelism Optimizations for Commercial Applications
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Inferential queueing and speculative push
International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
A case study of multi-threading in the embedded space
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Lightweight lock-free synchronization methods for multithreading
Proceedings of the 20th annual international conference on Supercomputing
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
The cost of concurrent, low-contention Read&Modify&Write
Theoretical Computer Science - Foundations of software science and computation structures
ACM Transactions on Computer Systems (TOCS)
Self-tuning reactive diffracting trees
Journal of Parallel and Distributed Computing
The mechanics of in-kernel synchronization for a scalable microkernel
ACM SIGOPS Operating Systems Review
SNZI: scalable NonZero indicators
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Techniques for efficient placement of synchronization primitives
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
A scalable lock-free stack algorithm
Journal of Parallel and Distributed Computing
Fast barrier synchronization for InfiniBand™
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Architectural Support for Fair Reader-Writer Locking
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
An adaptive technique for constructing robust and high-throughput shared objects
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
CAFÉ: scalable task pools with adjustable fairness and contention
DISC'11 Proceedings of the 25th international conference on Distributed computing
Speeding-up synchronizations in DSM multiprocessors
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Constructing shared objects that are both robust and high-throughput
DISC'06 Proceedings of the 20th international conference on Distributed Computing
Revisiting the combining synchronization technique
Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Support for fine-grained synchronization in shared-memory multiprocessors
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
DISC'12 Proceedings of the 26th international conference on Distributed Computing
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Elimination Trees and the Construction of Pools and Stacks
Theory of Computing Systems
Hi-index | 0.02 |
This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that the interconnect supports broadcast.The primitives make use of synchronization bits (syncbits) to provide a simple mechanism for mutual exclusion. The proposed implementation of the primitives includes efficient (i.e. local) busy-waiting for syncbits. In addition, a hardware-supported mechanism for maintaining a first-come first-serve queue of requests for a syncbit is proposed. This queueing mechanism allows for a very efficient implementation of, as well as fair access to, binary semaphores. We also propose to implement Fetch and Add with combining in software rather than hardware. This allows an architecture to scale to a large number of processors while avoiding the cost of hardware combining.Scenarios for common synchronization events such as work queues and barriers are presented to demonstrate the generality and ease of use of the proposed primitives. The efficient implementation of the primitives is simpler if the multiprocessor has a hardware cache-consistency protocol. To illustrate this point, we outline how the primitives would be implemented in the Multicube multiprocessor [GoWo88].