Understanding and mitigating refresh overheads in high-density DDR4 DRAM systems

  • Authors:
  • Janani Mukundan;Hillery Hunter;Kyu-hyoun Kim;Jeffrey Stuecheli;José F. Martínez

  • Affiliations:
  • Cornell University, Ithaca, NY;IBM Thomas J. Watson, Yorktown Heights, NY;IBM Thomas J. Watson, Yorktown Heights, NY;IBM Systems and Tech. Group, Austin, TX;Cornell University, Ithaca, NY

  • Venue:
  • Proceedings of the 40th Annual International Symposium on Computer Architecture
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent DRAM specifications exhibit increasing refresh latencies. A refresh command blocks a full rank, decreasing available parallelism in the memory subsystem significantly, thus decreasing performance. Fine Granularity Refresh (FGR) is a feature recently announced as part of JEDEC's DDR4 DRAM specification that attempts to tackle this problem by creating a range of refresh options that provide a trade-off between refresh latency and frequency. In this paper, we first conduct an analysis of DDR4 DRAM's FGR feature, and show that there is no one-size-fits-all option across a variety of applications. We then present Adaptive Refresh (AR), a simple yet effective mechanism that dynamically chooses the best FGR mode for each application and phase within the application. When looking at the refresh problem more closely, we identify in high-density DRAM systems a phenomenon that we call command queue seizure, whereby the memory controller's command queue seizes up temporarily because it is full with commands to a rank that is being refreshed. To attack this problem, we propose two complementary mechanisms called Delayed Command Expansion (DCE) and Preemptive Command Drain (PCD). Our results show that AR does exploit DDR4's FGR effectively. However, once our proposed DCE and PCD mechanisms are added, DDR4's FGR becomes redundant in most cases, except in a few highly memory-sensitive applications, where the use of AR does provide some additional benefit. In all, our simulations show that the proposed mechanisms yield 8% (14%) mean speedup with respect to traditional refresh, at normal (extended) DRAM operating temperatures, for a set of diverse parallel applications.