Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

Authors:
Moinuddin K. Qureshi;Gabe H. Loh
Affiliations:
-;-
Venue:
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2012

Citing 12
Cited 7

A Case for Direct-Mapped Caches

Computer
A modified approach to data cache management

Proceedings of the 28th annual international symposium on Microarchitecture
Using SimPoint for accurate and efficient simulation

SIGMETRICS '03 Proceedings of the 2003 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Cache miss behavior: is it √2?

Proceedings of the 3rd conference on Computing frontiers
Adaptive insertion policies for high performance caching

Proceedings of the 34th annual international symposium on Computer architecture
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Using dead blocks as a virtual victim cache

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
SHiP: signature-based hit predictor for high performance caching

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap

IEEE Micro
Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management

IEEE Computer Architecture Letters

A dual grain hit-miss detector for large die-stacked DRAM caches

Proceedings of the Conference on Design, Automation and Test in Europe
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

Proceedings of the 40th Annual International Symposium on Computer Architecture
Resilient die-stacked DRAM caches

Proceedings of the 40th Annual International Symposium on Computer Architecture
Software-controlled transparent management of heterogeneous memory resources in virtualized systems

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Large-reach memory management unit caches

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
ARI: Adaptive LLC-memory traffic management

ACM Transactions on Architecture and Code Optimization (TACO)
Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper analyzes the design trade-offs in architecting large-scale DRAM caches. Prior research, including the recent work from Loh and Hill, have organized DRAM caches similar to conventional caches. In this paper, we contend that some of the basic design decisions typically made for conventional caches (such as serialization of tag and data access, large associativity, and update of replacement state) are detrimental to the performance of DRAM caches, as they exacerbate the already high hit latency. We show that higher performance can be obtained by optimizing the DRAM cache architecture first for latency, and then for hit rate. We propose a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst. We also propose a simple and highly effective Memory Access Predictor that incurs a storage overhead of 96 bytes per core and a latency of 1 cycle. It helps service cache misses faster without the need to wait for a cache miss detection in the common case. Our evaluations show that our latency-optimized cache design significantly outperforms both the recent proposal from Loh and Hill, as well as an impractical SRAM Tag-Store design that incurs an unacceptable overhead of several tens of megabytes. On average, the proposal from Loh and Hill provides 8.7% performance improvement, the "idealized" SRAM Tag design provides 24%, and our simple latency-optimized design provides 35%.