Token coherence: decoupling performance and correctness

Authors:
Milo M. K. Martin;Mark D. Hill;David A. Wood
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
Proceedings of the 30th annual international symposium on Computer architecture
Year:
2003

Citing 34
Cited 73

A class of compatible cache consistency protocols and their support by the IEEE futurebus

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
A cache consistency protocol for multiprocessors with multistage networks

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Race-free interconnection networks and multiprocessor consistency

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The Stanford Dash Multiprocessor

Computer
Cooperative shared memory: software and hardware for scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The KSR1: experimentation and modeling of poststore

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Scalable cache consistency for hierarchically structured multiprocessors

The Journal of Supercomputing
Data forwarding in scalable shared-memory multiprocessors

ICS '95 Proceedings of the 9th international conference on Supercomputing
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Digital systems engineering

Digital systems engineering
Multicast snooping: a new coherence method using a multicast address network

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
CACHET: an adaptive cache coherence protocol for distributed shared-memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Full-system timing-first simulation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach
Shared Memory Consistency Models: A Tutorial

Computer
Simics: A Full System Simulation Platform

Computer
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
High-Speed Electrical Signaling: Overview and Limitations

IEEE Micro
Starfire: Extending the SMP Envelope

IEEE Micro
Simulating a $2M Commercial Server on a $2K PC

Computer
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Owner prediction for accelerating cache-to-cache transfer misses in a cc-NUMA architecture

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Scalability Port: A Coherent Interface for Shared Memory Multiprocessors

HOTI '02 Proceedings of the 10th Symposium on High Performance Interconnects HOT Interconnects
An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
Address Partitioning in DSM Clusters with Parallel Coherence Controllers

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
The Alpha 21364 Network Architecture

HOTI '01 Proceedings of the The Ninth Symposium on High Performance Interconnects
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Bandwidth Adaptive Snooping

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture

Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Transactional Memory Coherence and Consistency

Proceedings of the 31st annual international symposium on Computer architecture
Coherence decoupling: making use of incoherence

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Speculative Incoherent Cache Protocols

IEEE Micro
Temporal Streaming of Shared Memory

Proceedings of the 32nd annual international symposium on Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Store-Ordered Streaming of Shared Memory

Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
An efficient cache design for scalable glueless shared-memory multiprocessors

Proceedings of the 3rd conference on Computing frontiers
Cooperative Caching for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Dense Gaussian networks: suitable topologies for on-chip multiprocessors

International Journal of Parallel Programming
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
In-Network Cache Coherence

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Proximity-aware directory-based coherence for multi-core processor architectures

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Making the fast case common and the uncommon case simple in unbounded transactional memory

Proceedings of the 34th annual international symposium on Computer architecture
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
The Power of Priority: NoC Based Distributed Cache Coherency

NOCS '07 Proceedings of the First International Symposium on Networks-on-Chip
Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A consistency architecture for hierarchical shared caches

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
TokenTM: Efficient Execution of Large Transactions with Hardware Transactional Memory

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Reducing the Interconnection Network Cost of Chip Multiprocessors

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
SP-NUCA: a cost effective dynamic non-uniform cache architecture

ACM SIGARCH Computer Architecture News
Comparison of memory write policies for NoC based multicore cache coherent systems

Proceedings of the conference on Design, automation and test in Europe
Extending CC-NUMA systems to support write update optimizations

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Two proposals for the inclusion of directory information in the last-level private caches of glueless shared-memory multiprocessors

Journal of Parallel and Distributed Computing
Distributed cooperative caching

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Scalable and reliable communication for hardware transactional memory

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Efficient unicast and multicast support for CMPs

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
A performance-correctness explicitly-decoupled architecture

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Reactive NUCA: near-optimal block placement and replication in distributed caches

Proceedings of the 36th annual international symposium on Computer architecture
Recursive partitioning multicast: A bandwidth-efficient routing for Networks-on-Chip

NOCS '09 Proceedings of the 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip
In-network coherence filtering: snoopy coherence without broadcasts

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Low-power snoop architecture for synchronized producer-consumer embedded multiprocessing

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Specifying and dynamically verifying address translation-aware memory consistency

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Direct coherence: bringing together performance and scalability in shared-memory multiprocessors

HiPC'07 Proceedings of the 14th international conference on High performance computing
NCID: a non-inclusive cache, inclusive directory architecture for flexible and efficient cache hierarchies

Proceedings of the 7th ACM international conference on Computing frontiers
Exploit temporal locality of shared data in SRC enabled CMP

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
Architectural implications of cache coherence protocols with network applications on chip multiprocessors

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
The auction: optimizing banks usage in Non-Uniform Cache Architectures

Proceedings of the 24th ACM International Conference on Supercomputing
Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors

Proceedings of the 37th annual international symposium on Computer architecture
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
Ultra Fine-Grained Run-Time Power Gating of On-chip Routers for CMPs

NOCS '10 Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip
Token tenure and PATCH: A predictive/adaptive token-counting hybrid

ACM Transactions on Architecture and Code Optimization (TACO)
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Subspace snooping: filtering snoops with operating system support

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SWEL: hardware cache coherence protocols to map shared data onto shared caches

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient lookahead routing and header compression for multicasting in networks-on-chip

Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems
A generic adaptive path-based routing method for MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Iris: A hybrid nanophotonic network design for high-performance and low-power on-chip communication

ACM Journal on Emerging Technologies in Computing Systems (JETC)
A composite and scalable cache coherence protocol for large scale CMPs

Proceedings of the international conference on Supercomputing
Exploring partitioning methods for 3D Networks-on-Chip utilizing adaptive routing model

NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip
Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing
Verifying safety of a token coherence implementation by parametric compositional refinement

VMCAI'05 Proceedings of the 6th international conference on Verification, Model Checking, and Abstract Interpretation
A novel lightweight directory architecture for scalable shared-memory multiprocessors

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A systematic methodology to develop resilient cache coherence protocols

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Manager-client pairing: a framework for implementing coherence hierarchies

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Protozoa: adaptive granularity cache coherence

Proceedings of the 40th Annual International Symposium on Computer Architecture
An efficient, low-cost routing framework for convex mesh partitions to support virtualization

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
Enabling power efficiency through dynamic rerouting on-chip

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
Design and formal verification of a hierarchical cache coherence protocol for NoC based multiprocessors

The Journal of Supercomputing
The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
Dual partitioning multicasting for high-performance on-chip networks

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many future shared-memory multiprocessor servers will both target commercial workloads and use highly-integrated "glueless" designs. Implementing low-latency cache coherence in these systems is difficult, because traditional approaches either add indirection for common cache-to-cache misses (directory protocols) or require a totally-ordered interconnect (traditional snooping protocols). Unfortunately, totally-ordered interconnects are difficult to implement in glueless designs. An ideal coherence protocol would avoid indirections and interconnect ordering; however, such an approach introduces numerous protocol races that are difficult to resolve.We propose a new coherence framework to enable such protocols by separating performance from correctness. A performance protocol can optimize for the common case (i.e., absence of races) and rely on the underlying correctness substrate to resolve races, provide safety, and prevent starvation. We call the combination Token Coherence, since it explicitly exchanges and counts tokens to control coherence permissions.This paper develops TokenB, a specific Token Coherence performance protocol that allows a glueless multiprocessor to both exploit a low-latency unordered interconnect (like directory protocols) and avoid indirection (like snooping protocols). Simulations using commercial workloads show that our new protocol can significantly outperform traditional snooping and directory protocols.