Energy-efficient MESI cache coherence with pro-active snoop filtering for multicore microprocessors
Proceedings of the 13th international symposium on Low power electronics and design
A framework for low energy data management in reconfigurable multi-context architectures
Journal of Systems Architecture: the EUROMICRO Journal
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 23rd international conference on Supercomputing
In-network coherence filtering: snoopy coherence without broadcasts
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
Proceedings of the 38th annual international symposium on Computer architecture
Reducing last level cache pollution through OS-level software-controlled region-based partitioning
Proceedings of the 27th Annual ACM Symposium on Applied Computing
A dual grain hit-miss detector for large die-stacked DRAM caches
Proceedings of the Conference on Design, Automation and Test in Europe
Building expressive, area-efficient coherence directories
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
TLC: a tag-less cache for reducing dynamic first level cache energy
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Decoupled compressed cache: exploiting spatial locality for energy-optimized compressed caching
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Heterogeneous system coherence for integrated CPU-GPU systems
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
DP&TB: a coherence filtering protocol for many-core chip multiprocessors
The Journal of Supercomputing
Hi-index | 0.02 |
Current on-chip block-centric memory hierarchies exploit access patterns at the fine-grain scale of small blocks. Several recently proposed techniques for coherence traffic reduction and prefetching suggest that further useful patterns emerge with a macroscopic, coarse-grain view. To exploit coarse- grain behavior, previous work extended conventional caches with additional coarse-grain tracking and management structures considerably increasing overall cost and complexity. This paper demonstrates that as multi-megabyte caches have become commonplace, coarse-grain tracking and management no longer needs to be an afterthought. This functionality comes "for free" via RegionTracker. RegionTracker is a dual-grain cache design that maintains block-level communication while directly supporting coarse-grain tracking and management. Compared to a block-centric conventional cache of the same data capacity, RegionTracker requires less area to achieve a nearly identical miss rate (within 1%). RegionTracker can be used as the building block for coarse-grain optimizations, reducing their overall cost and easing their adoption. Using full-system simulation of a quad- core chip multiprocessor, commercial workloads, and area estimates based on full-custom layouts on a 130nm commercial technology, we demonstrate the performance and cost viability of the RegionTracker design. We also demonstrate the potential of RegionTracker as a framework for coarse-grain optimizations by showing that it boosts the benefits and reduces the cost of a previously proposed snoop reduction technique.