Aegis: partitioning data block for efficient recovery of stuck-at-faults in phase change memory

Authors:
Jie Fan;Song Jiang;Jiwu Shu;Youhui Zhang;Weimin Zhen
Affiliations:
Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China;Wayne State University, Detroit, MI;Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China;Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China;Tsinghua University, Beijing, China and Tsinghua National Laboratory for Information Science and Technology, Beijing, China
Venue:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
Year:
2013

Citing 12
Cited 1

Architecting phase change memory as a scalable dram alternative

Proceedings of the 36th annual international symposium on Computer architecture
A durable and energy efficient main memory using phase change memory technology

Proceedings of the 36th annual international symposium on Computer architecture
Better I/O through byte-addressable, persistent memory

Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Dynamically replicated memory: building reliable systems from nanoscale resistive memories

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Use ECP, not ECC, for hard failures in resistive memories

Proceedings of the 37th annual international symposium on Computer architecture
Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping

Proceedings of the 37th annual international symposium on Computer architecture
SAFER: Stuck-At-Fault Error Recovery for Memories

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
FREE-p: Protecting non-volatile memory against both hard and soft errors

HPCA '11 Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture
Pay-As-You-Go: low-overhead hard-error correction for phase change memories

Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
RDIS: A recursively defined invertible set scheme to tolerate multiple stuck-at faults in resistive memory

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)
RePRAM: Re-cycling PRAM faulty blocks for extended lifetime

DSN '12 Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

NVM duet: unified working memory and persistent store architecture

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

While Phase Change Memory (PCM) holds a great promise as a complement or even replacement of DRAM-based memory and flash-based storage, it must effectively overcome its limit on write endurance to be a reliable device for an extended period of intensive use. The limited write endurance can lead to permanent stuck-at faults after a certain number of writes, which causes some memory cells permanently stuck at either '0' or '1'. State-of-the-art solutions apply a bit inversion technique on selected bit groups of a data block after its partitioning. The effectiveness of this approach hinges on how a data block is partitioned into bit groups. While all existing solutions can separate faults into different groups for error correction, they are inadequate on three fundamental capabilities desired for any partition scheme. First, it can maximize probability of successfully re-partitioning a block so that two faults currently in the same group are placed into two new groups. Second, it can partition a block into a small number of groups for space efficiency. Third, it should spread out faults across the groups as uniformly as possible, so that more faults can be accommodated within the same number of groups. A recovery solution with these capabilities can provide strong fault tolerance with minimal overhead. We propose Aegis, a recovery solution with a systematical partition scheme using fewer groups to accommodate more faults compared with state-of-the-art schemes. The uniqueness of Aegis's partition scheme lies on its guarantee that any two bits in the same group will not be in the same group after a re-partition. Empowered by the partition scheme, Aegis can recover significantly more faults with reduced space overhead relative to state-of-the-art solutions.