Efficiently enabling conventional block sizes for very large die-stacked DRAM caches

Authors:
Gabriel H. Loh;Mark D. Hill
Affiliations:
AMD Research, Advanced Micro Devices, Inc.;University of Wisconsin, Madison, and AMD Research, Advanced Micro Devices, Inc.
Venue:
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2011

Citing 27
Cited 17

Decoupled sectored caches: conciliating low tag implementation cost

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The pool of subsectors cache design

ICS '99 Proceedings of the 13th international conference on Supercomputing
Memory access scheduling

Proceedings of the 27th annual international symposium on Computer architecture
Interconnect characteristics of 2.5-D system integration scheme

Proceedings of the 2001 international symposium on Physical design
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
Experimental evaluation of on-chip microprocessor cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
JETTY: Filtering Snoops for Reduced Energy Consumption in SMP Servers

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Filtering Superfluous Prefetches Using Density Vectors

ICCD '01 Proceedings of the International Conference on Computer Design: VLSI in Computers & Processors
Design and Optimization of Large Size and Low Overhead Off-Chip Caches

IEEE Transactions on Computers
3D Processing Technology and Its Impact on iA32 Microprocessors

ICCD '04 Proceedings of the IEEE International Conference on Computer Design
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Proceedings of the 32nd annual international symposium on Computer Architecture
Three-Dimensional Cache Design Exploration Using 3DCacti

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Implementing Caches in a 3D Technology for High Performance Processors

ICCD '05 Proceedings of the 2005 International Conference on Computer Design
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset

ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Bridging the Processor-Memory Performance Gapwith 3D IC Technology

IEEE Design & Test
Design space exploration for 3D architectures

ACM Journal on Emerging Technologies in Computing Systems (JETC)
The M5 Simulator: Modeling Networked Systems

IEEE Micro
PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Die Stacking (3D) Microarchitecture

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs

Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
3D-Stacked Memory Architectures for Multi-core Processors

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures

PACT '09 Proceedings of the 2009 18th International Conference on Parallel Architectures and Compilation Techniques
Structural aspects of the system/360 model 85: II the cache

IBM Systems Journal
Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor

IEEE Micro
System-level power/performance evaluation of 3D stacked DRAMs for mobile applications

Proceedings of the Conference on Design, Automation and Test in Europe
Simple but Effective Heterogeneous Main Memory with On-Chip Memory Controller Support

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies

Proceedings of the 9th conference on Computing Frontiers
LOT-ECC: localized and tiered reliability mechanisms for commodity memory systems

Proceedings of the 39th Annual International Symposium on Computer Architecture
A multi-core memory organization for 3-d DRAM as main memory

ARCS'13 Proceedings of the 26th international conference on Architecture of Computing Systems
Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy

MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Reuse-based online models for caches

Proceedings of the ACM SIGMETRICS/international conference on Measurement and modeling of computer systems
Adaptive cache management for a combined SRAM and DRAM cache hierarchy for multi-cores

Proceedings of the Conference on Design, Automation and Test in Europe
A dual grain hit-miss detector for large die-stacked DRAM caches

Proceedings of the Conference on Design, Automation and Test in Europe
Reducing memory access latency with asymmetric DRAM bank organizations

Proceedings of the 40th Annual International Symposium on Computer Architecture
Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache

Proceedings of the 40th Annual International Symposium on Computer Architecture
Resilient die-stacked DRAM caches

Proceedings of the 40th Annual International Symposium on Computer Architecture
The case for a scalable coherence protocol for complex on-chip cache hierarchies in many core systems

PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Large-reach memory management unit caches

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Reducing inter-core cache contention with an adaptive bank mapping policy in DRAM cache

Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis
Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Die-stacking technology enables multiple layers of DRAM to be integrated with multicore processors. A promising use of stacked DRAM is as a cache, since its capacity is insufficient to be all of main memory (for all but some embedded systems). However, a 1GB DRAM cache with 64-byte blocks requires 96MB of tag storage. Placing these tags on-chip is impractical (larger than on-chip L3s) while putting them in DRAM is slow (two full DRAM accesses for tag and data). Larger blocks and sub-blocking are possible, but less robust due to fragmentation. This work efficiently enables conventional block sizes for very large die-stacked DRAM caches with two innovations. First, we make hits faster than just storing tags in stacked DRAM by scheduling the tag and data accesses as a compound access so the data access is always a row buffer hit. Second, we make misses faster with a MissMap that eschews stacked-DRAM access on all misses. Like extreme sub-blocking, our implementation of the MissMap stores a vector of block-valid bits for each "page" in the DRAM cache. Unlike conventional sub-blocking, the MissMap (a) points to many more pages than can be stored in the DRAM cache (making the effects of fragmentation rare) and (b) does not point to the "way" that holds a block (but defers to the off-chip tags). For the evaluated large-footprint commercial workloads, the proposed cache organization delivers 92.9% of the performance benefit of an ideal 1GB DRAM cache with an impractical 96MB on-chip SRAM tag array.