Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory

  • Authors:
  • Jie Fan;Song Jiang;Jiwu Shu;Youhui Zhang;Weimin Zhen

  • Affiliations:
  • Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China;Wayne State University, Detroit, MI;Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China;Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China;Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China

  • Venue:
  • Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

While Phase Change Memory (PCM) holds a great promise as a complement or even replacement of DRAM-based memory and flash-based storage, it must effectively overcome its limit on write endurance to be a reliable device for an extended period of intensive use. The limited write endurance can lead to permanent stuck-at faults after a certain number of writes, which causes some memory cells permanently stuck at either '0' or '1'. State-of-the-art solutions apply a bit inversion technique on selected bit groups of a data block after its partitioning. The effectiveness of this approach hinges on how a data block is partitioned into bit groups. While all existing solutions can separate faults into different groups for error correction, they are inadequate on three fundamental capabilities desired for any partition scheme. First, it can maximize probability of successfully re-partitioning a block so that two faults currently in the same group are placed into two new groups. Second, it can partition a block into a small number of groups for space efficiency. Third, it should spread out faults across the groups as uniformly as possible, so that more faults can be accommodated within the same number of groups. A recovery solution with these capabilities can provide strong fault tolerance with minimal overhead. We propose Aegis, a recovery solution with a systematical partition scheme using fewer groups to accommodate more faults compared with state-of-the-art schemes. The uniqueness of Aegis's partition scheme lies on its guarantee that any two bits in the same group will not be in the same group after a re-partition. Empowered by the partition scheme, Aegis can recover significantly more faults with reduced space overhead relative to state-of-the-art solutions.