Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

Authors:
Alvin R. Lebeck;David A. Wood
Affiliations:
Computer Sciences Department, University of Wisconsin-Madison, Madison, Wisconsin;Computer Sciences Department, University of Wisconsin-Madison, Madison, Wisconsin
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 27
Cited 48

Memory access buffering in multiprocessors

ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
An evaluation of directory schemes for cache coherence

ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
Access ordering and coherence in shared memory multiprocessors

Access ordering and coherence in shared memory multiprocessors
Compiler-Directed Cache Management in Multiprocessors

Computer
LimitLESS directories: A scalable cache coherence scheme

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Tolerating latency through software-controlled prefetching in shared-memory multiprocessors

Journal of Parallel and Distributed Computing - Special issue on shared-memory multiprocessors
Comparative evaluation of latency reducing and tolerating techniques

ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The Stanford Dash Multiprocessor

Computer
SPLASH: Stanford parallel applications for shared-memory

ACM SIGARCH Computer Architecture News
DDM: A Cache-Only Memory Architecture

Computer
Cooperative shared memory: software and hardware for scalable multiprocessor

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
Adaptive cache coherency for detecting migratory shared data

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
An adaptive cache coherence protocol optimized for migratory sharing

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
Mechanisms for cooperative shared memory

ISCA '93 Proceedings of the 20th annual international symposium on computer architecture
The Wisconsin Wind Tunnel: virtual prototyping of parallel computers

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Cache coherence using local knowledge

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Combined performance gains of simple cache protocol extensions

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tempest and typhoon: user-level shared memory

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Tools and techniques for memory system design and analysis

Tools and techniques for memory system design and analysis
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
A compiler-directed cache coherence scheme with improved intertask locality

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
A Case for NOW (Networks of Workstations)

IEEE Micro
Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps

IEEE Transactions on Parallel and Distributed Systems

A cost-comparison approach for adaptive distributed shared memory

ICS '96 Proceedings of the 10th international conference on Supercomputing
Adaptive migratory scheme for distributed shared memory

ICS '97 Proceedings of the 11th international conference on Supercomputing
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Using prediction to accelerate coherence protocols

Proceedings of the 25th annual international symposium on Computer architecture
Memory sharing predictor: the key to a speculative coherent DSM

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
PSCR: A Coherence Protocol for Eliminating Passive Sharing in Shared-Bus Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Scalable Consistency Protocols for Distributed Services

IEEE Transactions on Parallel and Distributed Systems
A high-level abstraction of shared accesses

ACM Transactions on Computer Systems (TOCS)
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Cacheminer: A Runtime Approach to Exploit Cache Locality on SMP

IEEE Transactions on Parallel and Distributed Systems
Cache decay: exploiting generational behavior to reduce cache leakage power

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Let caches decay: reducing leakage energy via exploitation of cache generational behavior

ACM Transactions on Computer Systems (TOCS)
Cache-Line Decay: A Mechanism to Reduce Cache Leakage Power

PACS '00 Proceedings of the First International Workshop on Power-Aware Computer Systems-Revised Papers
The Use of Prediction for Accelerating Upgrade Misses in cc-NUMA Multiprocessors

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
Exact Distributed Invalidation

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Enforcing Cache Coherence at Data Sharing Boundaries without Global Control: A Hardware-Software Approach (Research Note)

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Inferential queueing and speculative push for reducing critical communication latencies

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Slipstream Execution Mode for CMP-Based Multiprocessors

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Towards general and exact distributed invalidation

Journal of Parallel and Distributed Computing
Coherence decoupling: making use of incoherence

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
STICS: SCSI-to-IP cache for storage area networks

Journal of Parallel and Distributed Computing
Speculative Incoherent Cache Protocols

IEEE Micro
Memory coherence activity prediction in commercial workloads

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
On the correctness of program execution when cache coherence is maintained locally at data-sharing boundaries in distributed shared memory multiprocessors

International Journal of Parallel Programming
Inferential queueing and speculative push

International Journal of Parallel Programming - Special issue I: The 17th annual international conference on supercomputing (ICS'03)
Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Proceedings of the 33rd annual international symposium on Computer Architecture
Supporting nested transactional memory in logTM

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Extending CC-NUMA systems to support write update optimizations

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
ESKIMO: Energy savings using Semantic Knowledge of Inconsequential Memory Occupancy for DRAM subsystem

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Replication-aware leakage management in chip multiprocessors with private L2 cache

Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design
Using dead blocks as a virtual victim cache

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
An adaptive cache coherence protocol for chip multiprocessors

Proceedings of the Second International Forum on Next-Generation Multicore/Manycore Technologies
Improving cache locality for thread-level speculation

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Sampling Dead Block Prediction for Last-Level Caches

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Checkpointing speculative distributed shared memory

PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Write invalidation analysis in chip multiprocessors

PATMOS'09 Proceedings of the 19th international conference on Integrated Circuit and System Design: power and Timing Modeling, Optimization and Simulation
Complexity-effective multicore coherence

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Leakage energy reduction in cache memory by software self-invalidation

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Concerning with on-chip network features to improve cache coherence protocols for CMPs

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Edge chasing delayed consistency: pushing the limits of weak memory models

Proceedings of the 2012 ACM workshop on Relaxing synchronization for multicore and manycore scalability
DeNovoND: efficient hardware support for disciplined non-determinism

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Predicting Coherence Communication by Tracking Synchronization Points at Run Time

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A new perspective for efficient virtual-cache coherence

Proceedings of the 40th Annual International Symposium on Computer Architecture
Temporal-based multilevel correlating inclusive cache replacement

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency, where the latency of invalidating outstanding copies can increase a program's critical path.DSI is applicable to software, hardware, and hybrid coherence schemes. In this paper we evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency, DSI can exploit tear-off blocks---which eliminate both invalidation and acknowledgment messages---for a total reduction in messages of up to 26%.