RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence

Authors:
Andreas Moshovos
Affiliations:
Univerisity of Toronto
Venue:
Proceedings of the 32nd annual international symposium on Computer Architecture
Year:
2005

Citing 27
Cited 53

Inexpensive implementations of set-associativity

ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Digital systems engineering

Digital systems engineering
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Wattch: a framework for architectural-level power analysis and optimizations

Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing

Proceedings of the 27th annual international symposium on Computer architecture
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Multiprocessor enhancements of the SimpleScalar tool set

ACM SIGARCH Computer Architecture News
TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors

Proceedings of the 2002 international symposium on Low power electronics and design
Cost-Effective Parallel Computing

Computer
Starfire: Extending the SMP Envelope

IEEE Micro
The Stanford Hydra CMP

IEEE Micro
Exploring the Design Space of Future CMPs

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
The Coherence Predictor Cache: A Resource-Efficient and Accurate Coherence Prediction Infrastructure

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Token coherence: decoupling performance and correctness

Proceedings of the 30th annual international symposium on Computer architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Performance analysis of the Alpha 21364-based HP GS1280 multiprocessor

Proceedings of the 30th annual international symposium on Computer architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Bandwidth Adaptive Snooping

HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
High Performance Memory Systems

High Performance Memory Systems
SimFlex: a fast, accurate, flexible full-system simulation framework for performance evaluation of server architecture

ACM SIGMETRICS Performance Evaluation Review - Special issue on tools for computer architecture research
Memory coherence activity prediction in commercial workloads

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
The Thrifty Barrier: Energy-Aware Synchronization in Shared-Memory Multiprocessors

HPCA '04 Proceedings of the 10th International Symposium on High Performance Computer Architecture
POWER4 system microarchitecture

IBM Journal of Research and Development

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Store Memory-Level Parallelism Optimizations for Commercial Applications

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays

IEEE Micro
An efficient cache design for scalable glueless shared-memory multiprocessors

Proceedings of the 3rd conference on Computing frontiers
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
L-CBF: a low-power, fast counting bloom filter architecture

Proceedings of the 2006 international symposium on Low power electronics and design
Victim management in a cache hierarchy

IBM Journal of Research and Development - Advanced silicon technology
Stealth prefetching

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reducing snoop-energy in shared bus-based mpsocs by filtering useless broadcasts

Proceedings of the 17th ACM Great Lakes symposium on VLSI
Comparing memory systems for chip multiprocessors

Proceedings of the 34th annual international symposium on Computer architecture
Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors

CODES+ISSS '07 Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis
Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors

ACM Transactions on Design Automation of Electronic Systems (TODAES)
The impact of wrong-path memory references in cache-coherent multiprocessor systems

Journal of Parallel and Distributed Computing
Improving the accuracy of snoop filtering using stream registers

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Broadcast filtering-aware task assignment techniques for low-power MPSoCs

MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
Exploiting access semantics and program behavior to reduce snoop power in chip multiprocessors

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Branch-on-random

Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization
Latency and bandwidth efficient communication through system customization for embedded multiprocessors

Proceedings of the 45th annual Design Automation Conference
Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors

Proceedings of the 13th international symposium on Low power electronics and design
To Snoop or Not to Snoop: Evaluation of Fine-Grain and Coarse-Grain Snoop Filtering Techniques

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Comparative evaluation of memory models for chip multiprocessors

ACM Transactions on Architecture and Code Optimization (TACO)
L-CBF: a low-power, fast counting bloom filter architecture

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Broadcast filtering: Snoop energy reduction in shared bus-based low-power MPSoCs

Journal of Systems Architecture: the EUROMICRO Journal
Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Zero-content augmented caches

Proceedings of the 23rd international conference on Supercomputing
Low-power inter-core communication through cache partitioning in embedded multiprocessors

Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design: Chip on the Dunes
In-network coherence filtering: snoopy coherence without broadcasts

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
A tagless coherence directory

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Low-power snoop architecture for synchronized producer-consumer embedded multiprocessing

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Cohesion: a hybrid memory model for accelerators

Proceedings of the 37th annual international symposium on Computer architecture
TurboTag: lookup filtering to reduce coherence directory power

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
WAYPOINT: scaling coherence to thousand-core architectures

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Subspace snooping: filtering snoops with operating system support

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Energy- and Performance-Efficient Communication Framework for Embedded MPSoCs through Application-Driven Release Consistency

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Quantifying and reducing the effects of wrong-path memory references in cache-coherent multiprocessor systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Virtual Snooping: Filtering Snoops in Virtualized Multi-cores

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A workload-adaptive and reconfigurable bus architecture for multicore processors

International Journal of Reconfigurable Computing
Exploring the architecture of a stream register-based snoop filter

Transactions on high-performance embedded architectures and compilers III
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks

Proceedings of the 38th annual international symposium on Computer architecture
Filtering directory lookups in CMPs with write-through caches

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Dynamic, multi-core cache coherence architecture for power-sensitive mobile processors

CODES+ISSS '11 Proceedings of the seventh IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Filtering directory lookups in CMPs

Microprocessors & Microsystems
Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing
Region scheduling: efficiently using the cache architectures via page-level affinity

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Using partial tag comparison in low-power snoop-based chip multiprocessors

ISCA'10 Proceedings of the 2010 international conference on Computer Architecture
Spatiotemporal Coherence Tracking

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A dual grain hit-miss detector for large die-stacked DRAM caches

Proceedings of the Conference on Design, Automation and Test in Europe
Building expressive, area-efficient coherence directories

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
VGTS: variable granularity transactional snoop

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Multi-grain coherence directories

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Heterogeneous system coherence for integrated CPU-GPU systems

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
DP&TB: a coherence filtering protocol for many-core chip multiprocessors

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been shown that many requests miss in all remote nodes in shared memory multiprocessors. We are motivated by the observation that this behavior extends to much coarser grain areas of memory. We define a region to be a continuous, aligned memory area whose size is a power of two and observe that many requests find that no other node caches a block in the same region even for regions as large as 16K bytes. We propose RegionScout, a family of simple filter mechanisms that dynamically detect most non-shared regions. A node with a RegionScout filter can determine in advance that a request will miss in all remote nodes. RegionScout filters are implemented as a layered extension over existing snoop-based coherence systems. They require no changes to existing coherence protocols or caches and impose no constraints on what can be cached simultaneously. Their operation is completely transparent to software and the operating system. RegionScout filters require little additional storage and a single additional global signal. These characteristics are made possible by utilizing imprecise information about the regions cached in each node. Since they rely on dynamically collected information RegionScout filters can adapt to changing sharing patterns. We present two applications of RegionScout: In the first RegionScout is used to avoid broadcasts for non-shared regions thus reducing bandwidth. In the second RegionScout is used to avoid snoop induced tag lookups thus reducing energy.