Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Authors:
Jason F. Cantin;Mikko H. Lipasti;James E. Smith
Affiliations:
University of Wisconsin - Madison;University of Wisconsin - Madison;University of Wisconsin - Madison
Venue:
Proceedings of the 32nd annual international symposium on Computer Architecture
Year:
2005

Citing 18
Cited 37

A class of compatible cache consistency protocols and their support by the IEEE futurebus

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
Adjustable block size coherent caches

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The PowerPC architecture: a specification for a new family of RISC processors

The PowerPC architecture: a specification for a new family of RISC processors
Decoupled sectored caches: conciliating low tag implementation cost

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Isotach Networks

IEEE Transactions on Parallel and Distributed Systems
The pool of subsectors cache design

ICS '99 Proceedings of the 13th international conference on Supercomputing
Timestamp snooping: an approach for extending SMPs

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors

Proceedings of the 2002 international symposium on Low power electronics and design
The sun fireplane system interconnect

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Simulating a $2M Commercial Server on a $2K PC

Computer
A dynamic cache sub-block design to reduce false sharing

ICCD '95 Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors
Experimental evaluation of on-chip microprocessor cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
IBM Power5 Chip: A Dual-Core Multithreaded Processor

IEEE Micro

RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays

IEEE Micro
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Victim management in a cache hierarchy

IBM Journal of Research and Development - Advanced silicon technology
Stealth prefetching

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Latency and bandwidth efficient communication through system customization for embedded multiprocessors

Proceedings of the 45th annual Design Automation Conference
Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors

Proceedings of the 13th international symposium on Low power electronics and design
Circuit-Switched Coherence

NOCS '08 Proceedings of the Second ACM/IEEE International Symposium on Networks-on-Chip
To Snoop or Not to Snoop: Evaluation of Fine-Grain and Coarse-Grain Snoop Filtering Techniques

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Zero-content augmented caches

Proceedings of the 23rd international conference on Supercomputing
In-network coherence filtering: snoopy coherence without broadcasts

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A tagless coherence directory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Low-power snoop architecture for synchronized producer-consumer embedded multiprocessing

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
TurboTag: lookup filtering to reduce coherence directory power

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Subspace snooping: filtering snoops with operating system support

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A workload-adaptive and reconfigurable bus architecture for multicore processors

International Journal of Reconfigurable Computing
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture
Filtering directory lookups in CMPs with write-through caches

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Filtering directory lookups in CMPs

Microprocessors & Microsystems
Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Using partial tag comparison in low-power snoop-based chip multiprocessors

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Spatiotemporal Coherence Tracking

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

Proceedings of the 40th Annual International Symposium on Computer Architecture
Building expressive, area-efficient coherence directories

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Multi-grain coherence directories

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Heterogeneous system coherence for integrated CPU-GPU systems

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
DP&TB: a coherence filtering protocol for many-core chip multiprocessors

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

To main coherence in conventional shared-memory multiprocessor systems, processors first check other proessors' caches before obtaining data from memory. This coherence checking adds latency to memory requests and leads to large amounts of interconnect traffic in broadcast-based systems. Our results for a set of commercial, scientific and multiprogrammed workloads show that on average 67% (and up to 94%) of broadcasts are unnecessary. Coarse-Grain Coherence Tracking is a new technique that supplements a conventional coherence mechanism and optimizes the performance of coherence enforcement. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that information to avoid unnecessary broadcasts. Coarse-Grain Coherence Tracking is shown to eliminate 55-97% of the unnecessary broadcasts, and improve performance by 8.8% on average (and up to 21.7%).