Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency

Authors:
Haiming Liu;Michael Ferdman;Jaehyuk Huh;Doug Burger
Affiliations:
Dept. of Computer Sciences, The University of Texas at Austin, USA;Dept. of Elec.&Comp. Eng., Carnegie Mellon University, USA;Advanced Micro Devices, USA;Microsoft Research, One Microsoft Way Redmond, WA, USA
Venue:
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Year:
2008

Citing 22
Cited 32

The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
A data cache with multiple caching strategies tuned to different types of locality

ICS '95 Proceedings of the 9th international conference on Supercomputing
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Utilizing reuse information in data cache management

ICS '98 Proceedings of the 12th international conference on Supercomputing
Run-Time Cache Bypassing

IEEE Transactions on Computers
Selective, accurate, and timely self-invalidation using last-touch prediction

Proceedings of the 27th annual international symposium on Computer architecture
Dead-block prediction & dead-block correlating prefetchers

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Cache decay: exploiting generational behavior to reduce cache leakage power

ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Drowsy caches: simple techniques for reducing leakage power

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Timekeeping in the memory system: predicting and optimizing memory behavior

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Modeling Live and Dead Lines in Cache Memory Systems

IEEE Transactions on Computers
Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques
Using the Compiler to Improve Cache Replacement Decisions

Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
TCP: Tag Correlating Prefetchers

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Self-correcting LRU replacement policies

Proceedings of the 1st conference on Computing frontiers
Memory coherence activity prediction in commercial workloads

WMPI '04 Proceedings of the 3rd workshop on Memory performance issues: in conjunction with the 31st international symposium on computer architecture
Cooperative Caching with Keep-Me and Evict-Me

INTERACT '05 Proceedings of the 9th Annual Workshop on Interaction between Compilers and Computer Architectures
IATAC: a smart predictor to turn-off L2 cache lines

ACM Transactions on Architecture and Code Optimization (TACO)
Hardware techniques to reduce communication costs in multiprocessors

Hardware techniques to reduce communication costs in multiprocessors
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
Counter-Based Cache Replacement and Bypassing Algorithms

IEEE Transactions on Computers

Less reused filter: improving l2 cache performance via filtering less reused lines

Proceedings of the 23rd international conference on Supercomputing
PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches

Proceedings of the 36th annual international symposium on Computer architecture
Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Pseudo-LIFO: the foundation of a new family of replacement policies for last-level caches

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
High performance cache replacement using re-reference interval prediction (RRIP)

Proceedings of the 37th annual international symposium on Computer architecture
Using dead blocks as a virtual victim cache

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Sampling Dead Block Prediction for Last-Level Caches

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
STEM: Spatiotemporal Management of Capacity for Intra-core Last Level Caches

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Management policies analysis for multi-core shared caches

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications - Volume Part II
Bypass and insertion algorithms for exclusive last-level caches

Proceedings of the 38th annual international symposium on Computer architecture
Evaluating placement policies for managing capacity sharing in CMP architectures with private caches

ACM Transactions on Architecture and Code Optimization (TACO)
SHiP: signature-based hit predictor for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Unified memory optimizing architecture: memory subsystem control with a unified predictor

Proceedings of the 26th ACM international conference on Supercomputing
Improving writeback efficiency with decoupled last-write prediction

Proceedings of the 39th Annual International Symposium on Computer Architecture
Combining recency of information with selective random and a victim cache in last-level caches

ACM Transactions on Architecture and Code Optimization (TACO)
Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Optimal bypass monitor for high performance last-level caches

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
The evicted-address filter: a unified mechanism to address both cache pollution and thrashing

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Exploiting reuse locality on inclusive shared last-level caches

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Improving Cache Management Policies Using Dynamic Reuse Distances

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
The locality-aware adaptive cache coherence protocol

Proceedings of the 40th Annual International Symposium on Computer Architecture
Managing shared last-level cache in a heterogeneous multicore processor

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Low-energy volatile STT-RAM cache design using cache-coherence-enabled adaptive refresh

ACM Transactions on Design Automation of Electronic Systems (TODAES)
A Buffered Dual-Access-Mode Scheme Designed for Low-Power Highly-Associative Caches

International Journal of Embedded and Real-Time Communication Systems
Insertion and promotion for tree-based PseudoLRU last-level caches

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
The reuse cache: downsizing the shared last-level cache

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Virtually split cache: An efficient mechanism to distribute instructions and data

ACM Transactions on Architecture and Code Optimization (TACO)
An efficient compiler framework for cache bypassing on GPUs

Proceedings of the International Conference on Computer-Aided Design
An effectiveness-based adaptive cache replacement policy

Microprocessors & Microsystems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Data caches in general-purpose microprocessors often contain mostly dead blocks and are thus used inefficiently. To improve cache efficiency, dead blocks should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed; however, these schemes yield lower prediction accuracy and coverage. Instead, we find that predicting the death of a block when it just moves out of the MRU position gives the best tradeoff between timeliness and prediction accuracy/coverage. Furthermore, the individual reference history of a block in the L1 cache can be irregular because of data/control dependence. This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block. A cache burst begins when a block becomes MRU and ends when it becomes non-MRU. Cache bursts are more predictable than individual references because they hide the irregularity of individual references. When used at the L1 cache, the best burst-based predictor can identify 96% of the dead blocks with a 96% accuracy. With the improved dead-block predictors, we evaluate three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching. The most effective approach, prefetching into dead blocks, increases the average L1 efficiency from 8% to 17% and the L2 efficiency from 17% to 27%. This increased cache efficiency translates into higher overall performance: prefetching into dead blocks outperforms the same prefetch scheme without dead-block prediction by 12% at the L1 and by 13% at the L2.