Efficient and correct execution of parallel programs that share memory
ACM Transactions on Programming Languages and Systems (TOPLAS)
Performance evaluation of memory consistency models for shared-memory multiprocessors
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Simple, fast, and practical non-blocking and blocking concurrent queue algorithms
PODC '96 Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing
Thin locks: featherweight synchronization for Java
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Specifying Concurrent Program Modules
ACM Transactions on Programming Languages and Systems (TOPLAS)
Introduction to algorithms
Hiding Relaxed Memory Consistency with a Compiler
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Lock reservation: Java locks can mostly do without atomic operations
OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Speculative Sequential Consistency with Little Custom Storage
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Cooperating sequential processes
The origin of concurrent programming
Automatic fence insertion for shared memory multiprocessing
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Scalable lock-free dynamic memory allocation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
TO-Lock: Removing Lock Overhead Using the Owners' Temporal Locality
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Dynamic circular work-stealing deque
Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures
Proceedings of the 33rd annual international symposium on Computer Architecture
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency
Proceedings of the 34th annual international symposium on Computer architecture
CheckFence: checking consistency of concurrent data types on relaxed memory models
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Early experience with a commercial hardware transactional memory implementation
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
Adaptive Locks: Combining Transactions and Locks for Efficient Concurrency
PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
DRFX: a simple and efficient memory model for concurrent programming languages
PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 37th annual international symposium on Computer architecture
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient sequential consistency using conditional fences
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient processor support for DRFx, a memory model with exceptions
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A Primer on Memory Consistency and Cache Coherence
A Primer on Memory Consistency and Cache Coherence
Efficient sequential consistency via conflict ordering
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Dynamic synthesis for relaxed memory models
Proceedings of the 33rd ACM SIGPLAN conference on Programming Language Design and Implementation
End-to-end sequential consistency
Proceedings of the 39th Annual International Symposium on Computer Architecture
Fence-free work stealing on bounded TSO processors
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Many modern multicore architectures support shared memory for ease of programming and relaxed memory models to deliver high performance. With relaxed memory models, memory accesses can be reordered dynamically and seen by other processors. Therefore, fence instructions are provided to enforce the memory orderings that are critical to the correctness of a program. However, fence instructions are costly as they cause the processor to stall. Prior works have observed that most of the executions of fence instructions are unnecessary. In this paper we propose address-aware fence, a hardware solution for reducing the overhead of fence instructions without resorting to speculation. Address-aware fence only enforces memory orderings that are necessary to maintain the effect that the traditional fence strives to enforce. This is achieved by dynamically checking a condition for when an execution of a fence must take effect and delay the memory accesses following the fence. When a fence instruction is encountered, first, necessary memory addresses are collected to form a watchlist, and then, only the memory accesses to addresses that are contained in the watchlist are delayed. The memory accesses whose addresses are not contained in the watchlist are allowed to complete without waiting for the completion of pending memory accesses from before the fence. Our experiments conducted on a group of concurrent lock-free algorithms and SPLASH-2 benchmarks show that address-aware fence eliminates nearly all the overhead due to fences and achieves an average improvement of 12.2\% on programs with traditional fences.