ACM Transactions on Programming Languages and Systems (TOPLAS)
The SPARC architecture manual: version 8
The SPARC architecture manual: version 8
The SPARC architecture manual (version 9)
The SPARC architecture manual (version 9)
Designing memory consistency models for shared-memory multiprocessors
Designing memory consistency models for shared-memory multiprocessors
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
PowerPC Microprocessor Common Hardware Reference Platform: A System Architecture
PowerPC Microprocessor Common Hardware Reference Platform: A System Architecture
A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs)
Journal of Parallel and Distributed Computing
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
A Better x86 Memory Model: x86-TSO
TPHOLs '09 Proceedings of the 22nd International Conference on Theorem Proving in Higher Order Logics
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Laws of order: expensive synchronization in concurrent algorithms cannot be eliminated
Proceedings of the 38th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Architectural Support for Fair Reader-Writer Locking
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A Primer on Memory Consistency and Cache Coherence
A Primer on Memory Consistency and Cache Coherence
Clarifying and compiling C/C++ concurrency: from C++11 to POWER
POPL '12 Proceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
DISC'06 Proceedings of the 20th international conference on Distributed Computing
Efficient sequential consistency via conflict ordering
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
End-to-end sequential consistency
Proceedings of the 39th Annual International Symposium on Computer Architecture
Hi-index | 0.00 |
Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 and SPARC which enforce (variants of) Total-Store-Order (TSO). A key reason is that RMWs in these architectures are ordered like a memory barrier, incurring the cost of a write-buffer drain in the critical path. Such strong ordering semantics are dictated by the requirements of the strict atomicity definition (type-1) that existing TSO RMWs use. Programmers often do not need such strong semantics. Besides, weakening the atomicity definition of TSO RMWs, would also weaken their ordering -- thereby leading to more efficient hardware implementations. In this paper we argue for TSO RMWs to use weaker atomicity definitions -- we consider two weaker definitions: type-2 and type-3, with different relaxed ordering differences. We formally specify how such weaker RMWs would be ordered, and show that type-2 RMWs, in particular, can seamlessly replace existing type-1 RMWs in common synchronization idioms -- except in situations where a type-1 RMW is used as a memory barrier. Recent work has shown that the new C/C++11 concurrency model can be realized by generating conventional (type-1) RMWs for C/C++11 SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid using the proposed type-2 RMWs; type-3 RMWs, on the other hand, could be used for SC-atomic-reads (and optionally SC-atomic-writes). We further propose efficient microarchitectural implementations for type-2 (type-3) RMWs -- simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) on a set of parallel programs, including those from the SPLASH-2, PARSEC, and STAMP benchmarks.