Memory access buffering in multiprocessors
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Implementing Precise Interrupts in Pipelined Processors
IEEE Transactions on Computers
The Stanford Dash Multiprocessor
Computer
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Techniques for reducing overheads of shared-memory multiprocessing
ICS '95 Proceedings of the 9th international conference on Supercomputing
Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures
The interaction of software prefetching with ILP processors in shared-memory systems
Proceedings of the 24th annual international symposium on Computer architecture
Complexity-effective superscalar processors
Proceedings of the 24th annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA
Proceedings of the 24th annual international symposium on Computer architecture
Weak ordering—a new definition
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
The Effectiveness of SRAM Network Caches in Clustered DSMs
HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Architecture and design of AlphaServer GS320
ACM SIGPLAN Notices
Modeling weakly consistent memories with locks
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Architecture and design of AlphaServer GS320
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Hiding Relaxed Memory Consistency with a Compiler
IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Speculative lock elision: enabling highly concurrent multithreaded execution
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Speculative synchronization: applying thread-level speculation to explicitly parallel applications
Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
High-level modeling and FPGA prototyping of microprocessors
FPGA '03 Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays
Speculative Sequential Consistency with Little Custom Storage
Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Improved Implementations of the Speculative Memory Access Mechanism specMEM
IWIA '99 Proceedings of the 1999 International Workshop on Innovative Architecture
A first glance at Kilo-instruction based multiprocessors
Proceedings of the 1st conference on Computing frontiers
Improving Hash Join Performance through Prefetching
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Tolerating Late Memory Traps in Dynamically Scheduled Processors
IEEE Transactions on Computers
TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model
Proceedings of the 31st annual international symposium on Computer architecture
Transactional Memory Coherence and Consistency
Proceedings of the 31st annual international symposium on Computer architecture
A unified theory of shared memory consistency
Journal of the ACM (JACM)
Coherence decoupling: making use of incoherence
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Compiler techniques for high performance sequentially consistent java programs
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Temporal Streaming of Shared Memory
Proceedings of the 32nd annual international symposium on Computer Architecture
Scalable Load and Store Processing in Latency Tolerant Processors
Proceedings of the 32nd annual international symposium on Computer Architecture
Store-Ordered Streaming of Shared Memory
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Store Memory-Level Parallelism Optimizations for Commercial Applications
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 33rd annual international symposium on Computer Architecture
Proceedings of the 33rd annual international symposium on Computer Architecture
A regulated transitive reduction (RTR) for longer memory race recording
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Unified microprocessor core storage
Proceedings of the 4th international conference on Computing frontiers
Mechanisms for store-wait-free multiprocessors
Proceedings of the 34th annual international symposium on Computer architecture
BulkSC: bulk enforcement of sequential consistency
Proceedings of the 34th annual international symposium on Computer architecture
Rerun: Exploiting Episodes for Lightweight Memory Race Recording
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
DracoSTM: a practical C++ approach to software transactional memory
LCSD '07 Proceedings of the 2007 Symposium on Library-Centric Software Design
Optimistic concurrency for clusters via speculative locking
SYSTOR '09 Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference
InvisiFence: performance-transparent memory ordering in conventional multiprocessors
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 36th annual international symposium on Computer architecture
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Proceedings of the 37th annual international symposium on Computer architecture
Journal of Parallel and Distributed Computing
Efficient sequential consistency using conditional fences
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
On the effectiveness of speculative and selective memory fences
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
RCDC: a relaxed consistency deterministic computer
Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
The impact of memory models on software reliability in multiprocessors
Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Efficient sequential consistency via conflict ordering
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
End-to-end sequential consistency
Proceedings of the 39th Annual International Symposium on Computer Architecture
Implicit transactional memory in kilo-instruction multiprocessors
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Volition: scalable and precise sequential consistency violation detection
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Fast RMWs for TSO: semantics and implementation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Proceedings of the 27th international ACM conference on International conference on supercomputing
Exploring memory consistency for massively-threaded throughput-oriented processors
Proceedings of the 40th Annual International Symposium on Computer Architecture
WeeFence: toward making fences free in TSO
Proceedings of the 40th Annual International Symposium on Computer Architecture
Heterogeneous-race-free memory models
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
Sequential consistency (SC) is the simplest programming interface for shared-memory systems but imposes program order among all memory operations, possibly precluding high performance implementations. Release consistency (RC), however, enables the highest performance implementations but puts the burden on the programmer to specify which memory operations need to be atomic and in program order. This paper shows, for the first time, that SC implementations can perform as well as RC implementations if the hardware provides enough support for speculation. Both SC and RC implementations rely on reordering and overlapping memory operations for high performance. To enforce order when necessary, an RC implementation uses software guarantees, whereas an SC implementation relies on hardware speculation. Our SC implementation, called SC++, closes the performance gap because: (1) the hardware allows not just loads, as some current SC implementations do, but also stores to bypass each other speculatively to hide remote latencies, (2) the hardware provides large speculative state for not just processor, as previously proposed, but also memory to allow out-of-order memory operations, (3) the support for hardware speculation does not add excessive overheads to processor pipeline critical paths, and (4) well-behaved applications incur infrequent rollbacks of speculative execution. Using simulation, we show that SC++ achieves an RC implementation's performance in all the six applications we studied.