Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Scalable reader-writer synchronization for shared-memory multiprocessors
PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
The Stanford Dash Multiprocessor
Computer
Memory system design for bus-based multiprocessors
Memory system design for bus-based multiprocessors
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The MIT Alewife machine: architecture and performance
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Synchronization and communication in the T3E multiprocessor
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
ICS '90 Proceedings of the 4th international conference on Supercomputing
Efficient synchronization: let them eat QOLB
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Exploiting fine-grain thread level parallelism on the MIT multi-ALU processor
Proceedings of the 25th annual international symposium on Computer architecture
System-on-a-chip processor synchronization support in hardware
Proceedings of the conference on Design, automation and test in Europe
Scalable queue-based spin locks with timeout
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Non-blocking timeout in scalable queue-based spin locks
Proceedings of the twenty-first annual symposium on Principles of distributed computing
Performance measurements on HEP - a pipelined MIMD computer
ISCA '83 Proceedings of the 10th annual international symposium on Computer architecture
Hierarchical Backoff Locks for Nonuniform Communication Architectures
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Eliminating synchronization-related atomic operations with biased locking and bulk rebiasing
Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
Concurrent programming without locks
ACM Transactions on Computer Systems (TOCS)
Proceedings of the 34th annual international symposium on Computer architecture
Code Generation and Optimization for Transactional Memory Constructs in an Unmanaged Language
Proceedings of the International Symposium on Code Generation and Optimization
A Fair Fast Scalable Rea,der-Writer Lock
ICPP '93 Proceedings of the 1993 International Conference on Parallel Processing - Volume 02
Privatization techniques for software transactional memory
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Dynamic performance tuning of word-based software transactional memory
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
The PARSEC benchmark suite: characterization and architectural implications
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Accelerating critical section execution with asymmetric multi-core architectures
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
The Art of Multiprocessor Programming
The Art of Multiprocessor Programming
TLRW: return of the read-write lock
Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Transactional Memory, 2nd Edition
Transactional Memory, 2nd Edition
Preemption adaptivity in time-published queue-based spin locks
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
DISC'06 Proceedings of the 20th international conference on Distributed Computing
Fast RMWs for TSO: semantics and implementation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the Conference on Design, Automation and Test in Europe
HARS: A hardware-assisted runtime software for embedded many-core architectures
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
Many shared-memory parallel systems use lock-based synchronization mechanisms to provide mutual exclusion or reader-writer access to memory locations. Software locks are inefficient either in memory usage, lock transfer time, or both. Proposed hardware locking mechanisms are either too specific (for example, requiring static assignment of threads to cores and vice-versa), support a limited number of concurrent locks, require tag values to be associated with every memory location, rely on the low latencies of single-chip multicore designs or are slow in adversarial cases such as suspended threads in a lock queue. Additionally, few proposals cover reader-writer locks and their associated fairness issues. In this paper we introduce the Lock Control Unit (LCU) which is an acceleration mechanism collocated with each core to explicitly handle fast reader-writer locking. By associating a unique thread-id to each lock request we decouple the hardware lock from the requestor core. This provides correct and efficient execution in the presence of thread migration. By making the LCU logic autonomous from the core, it seamlessly handles thread preemption. Our design offers richer semantics than previous proposals, such as try lock support while providing direct core-to-core transfers. We evaluate our proposal with micro benchmarks, a fine-grain Software Transactional Memory system and programs from the Parsec and Splash parallel benchmark suites. The lock transfer time decreases in up to 30\% when compared to previous hardware proposals. Transactional Memory systems limited by reader-locking congestion boost up to 3x while still preserving graceful fairness and starvation freedom properties. Finally, commonly used applications achieve speedups up to a 7% when compared to software models.