Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Authors:
Milo M. K. Martin;Pacia J. Harper;Daniel J. Sorin;Mark D. Hill;David A. Wood
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;Duke University;University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
Proceedings of the 30th annual international symposium on Computer architecture
Year:
2003

Citing 31
Cited 38

Cache Invalidation Patterns in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Potential of Compile-Time Analysis to Adapt the Cache Coherence Enforcement Strategy to the Data Sharing Characteristics

IEEE Transactions on Parallel and Distributed Systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Boosting the performance of hybrid snooping cache protocols

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Reactive NUMA: a design for unifying S-COMA and CC-NUMA

Proceedings of the 24th annual international symposium on Computer architecture
Memory system characterization of commercial workloads

Proceedings of the 25th annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Performance of database workloads on shared-memory systems with out-of-order processors

Proceedings of the eighth international conference on Architectural support for programming languages and operating systems
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Multicast snooping: a new coherence method using a multicast address network

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Adaptive software cache management for distributed shared memory architectures

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Full-system timing-first simulation

SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol

IEEE Transactions on Parallel and Distributed Systems
Simics: A Full System Simulation Platform

Computer
System Optimization for OLTP Workloads

IEEE Micro
Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Simulating a $2M Commercial Server on a $2K PC

Computer
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Two Adaptive Hybrid Cache Coherency Protocols

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Distance-Adaptive Update Protocols for Scalable Shared-Memory Multiprocessors

HPCA '96 Proceedings of the 2nd IEEE Symposium on High-Performance Computer Architecture
Improving CC-NUMA Performance Using Instruction-Based Prediction

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
WildFire: A Scalable Path for SMPs

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
Reducing Ownership Overhead for Load-Store Sequences in Cache-Coherent Multiprocessors

IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Improving Performance of Load-Store Sequences for Transaction Processing Workloads on Multiprocessors

ICPP '99 Proceedings of the 1999 International Conference on Parallel Processing
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Bandwidth Adaptive Snooping

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture

Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration

IEEE Transactions on Parallel and Distributed Systems
Coherence decoupling: making use of incoherence

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
System-wide performance monitors and their application to the optimization of coherent memory accesses

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Formal Verification and its Impact on the Snooping versus Directory Protocol Debate

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Lazy direct-to-cache transfer during receive operations in a message passing environment

Proceedings of the 3rd conference on Computing frontiers
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Program Counter-Based Prediction Techniques for Dynamic Power Management

IEEE Transactions on Computers
Coherence Ordering for Ring-based Chip Multiprocessors

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Speculative supplier identification for reducing power of interconnects in snoopy cache coherence protocols

Proceedings of the 4th international conference on Computing frontiers
Program-counter-based pattern classification in buffer caching

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Performance scalability of a multi-core web server

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems
A case for low-complexity MP architectures

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Using supplier locality in power-aware interconnects and caches in chip multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Extending CC-NUMA systems to support write update optimizations

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Improving support for locality and fine-grain sharing in chip multiprocessors

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Token tenure: PATCHing token counting using directory-based cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Dealing with Traffic-Area Trade-Off in Direct Coherence Protocols for Many-Core CMPs

APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
Access region cache with register guided memory reference partitioning

Journal of Systems Architecture: the EUROMICRO Journal
In-network coherence filtering: snoopy coherence without broadcasts

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Exploit temporal locality of shared data in SRC enabled CMP

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
Token tenure and PATCH: A predictive/adaptive token-counting hybrid

ACM Transactions on Architecture and Code Optimization (TACO)
Subspace snooping: filtering snoops with operating system support

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
SigNet: network-on-chip filtering for coarse vector directories

Proceedings of the Conference on Design, Automation and Test in Europe
An adaptive cache coherence protocol for chip multiprocessors

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing
Using partial tag comparison in low-power snoop-based chip multiprocessors

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Asymmetric Cache Coherency: Policy Modifications to Improve Multicore Performance

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
An efficient, low-cost routing framework for convex mesh partitions to support virtualization

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)
Bandwidth Adaptive Cache Coherence Optimizations for Chip Multiprocessors

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destination set (i.e., all processors), reducing latency for cache-to-cache misses at the expense of increased traffic. Directory protocols send requests to the minimal destination set, reducing bandwidth at the expense of an indirection through the directory for cache-to-cache misses. Recently proposed hybrid protocols trade-off latency and bandwidth by directly sending requests to a predicted destination set.This paper explores the destination-set predictor design space, focusing on a collection of important commercial workloads. First, we analyze the sharing behavior of these workloads. Second, we propose predictors that exploit the observed sharing behavior to target different points in the latency/bandwidth tradeoff. Third, we illustrate the effectiveness of destination-set predictors in the context of a multicast snooping protocol. For example, one of our predictors obtains almost 90% of the performance of snooping while using only 15% more bandwidth than a directory protocol (and less than half the bandwidth of snooping).