Coherence decoupling: making use of incoherence

Authors:
Jaehyuk Huh;Jichuan Chang;Doug Burger;Gurindar S. Sohi
Affiliations:
The University of Texas at Austin;University of Wisconsin-Madison;The University of Texas at Austin;University of Wisconsin-Madison
Venue:
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Year:
2004

Citing 40
Cited 19

HPS, a new microarchitecture: rationale and introduction

MICRO 18 Proceedings of the 18th annual workshop on Microprogramming
Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers

IEEE Transactions on Computers
Proving sequential consistency of high-performance shared memories (extended abstract)

SPAA '91 Proceedings of the third annual ACM symposium on Parallel algorithms and architectures
The Stanford Dash Multiprocessor

Computer
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
The detection and elimination of useless misses in multiprocessors

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Combined performance gains of simple cache protocol extensions

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Potential of Compile-Time Analysis to Adapt the Cache Coherence Enforcement Strategy to the Data Sharing Characteristics

IEEE Transactions on Parallel and Distributed Systems
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Boosting the performance of hybrid snooping cache protocols

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Multiscalar processors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ARB: A Hardware Mechanism for Dynamic Reordering of Memory References

IEEE Transactions on Computers
An evaluation of memory consistency models for shared-memory systems with ILP processors

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Dynamic speculation and synchronization of data dependences

Proceedings of the 24th annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Is SC + ILP = RC?

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Silent stores for free

Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Speculative lock elision: enabling highly concurrent multithreaded execution

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing

Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
Application-specific protocols for user-level shared memory

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Transactional lock-free execution of lock-based programs

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Speculative synchronization: applying thread-level speculation to explicitly parallel applications

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
Temporally silent stores

Proceedings of the 10th international conference on Architectural support for programming languages and operating systems
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Two Adaptive Hybrid Cache Coherency Protocols

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Distance-Adaptive Update Protocols for Scalable Shared-Memory Multiprocessors

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Speculative Versioning Cache

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Variability in Architectural Simulations of Multi-Threaded Workloads

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture

Proceedings of the 30th annual international symposium on Computer architecture

Scalable selective re-execution for EDGE architectures

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Speculative Incoherent Cache Protocols

IEEE Micro
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
A NUCA substrate for flexible CMP cache sharing

Proceedings of the 19th annual international conference on Supercomputing
Characterization of TCC on Chip-Multiprocessors

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
ReSlice: Selective Re-Execution of Long-Retired Misspeculated Instructions Using Forward Slicing

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Spatial Memory Streaming

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Speculative supplier identification for reducing power of interconnects in snoopy cache coherence protocols

Proceedings of the 4th international conference on Computing frontiers
Comparison of memory write policies for NoC based multicore cache coherent systems

Proceedings of the conference on Design, automation and test in Europe
Extending CC-NUMA systems to support write update optimizations

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
RETCON: transactional repair without replay

Proceedings of the 37th annual international symposium on Computer architecture
SWEL: hardware cache coherence protocols to map shared data onto shared caches

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

ACM Transactions on Design Automation of Electronic Systems (TODAES)
DeFT: Design space exploration for on-the-fly detection of coherence misses

ACM Transactions on Architecture and Code Optimization (TACO)
Transactional conflict decoupling and value prediction

Proceedings of the international conference on Supercomputing
Complexity-effective multicore coherence

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Edge chasing delayed consistency: pushing the limits of weak memory models

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores a new technique called coherence decoupling, which breaks a traditional cache coherence protocol into two protocols: a Speculative Cache Lookup (SCL) protocol and a safe, backing coherence protocol. The SCL protocol produces a speculative load value, typically from an invalid cache line, permitting the processor to compute with incoherent data. In parallel, the coherence protocol obtains the necessary coherence permissions and the correct value. Eventually, the speculative use of the incoherent data can be verified against the coherent data. Thus, coherence decoupling can greatly reduce --- if not eliminate --- the effects of false sharing. Furthermore, coherence decoupling can also reduce latencies incurred by true sharing. SCL protocols reduce those latencies by speculatively writing updates into invalid lines, thereby increasing the accuracy of speculation, without complicating the simple, underlying coherence protocol that guarantees correctness.The performance benefits of coherence decoupling are evaluated using a full-system simulator and a mix of commercial and scientific benchmarks. Our results show that 40% to 90% of all coherence misses can be speculated correctly, and therefore their latencies partially or fully hidden. This capability results in performance improvements ranging from 3% to over 16%, in most cases where the latencies of coherence misses have an effect on performance.