A realistic evaluation of memory hardware errors and software system susceptibility

  • Authors:
  • Xin Li;Michael C. Huang;Kai Shen;Lingkun Chu

  • Affiliations:
  • University of Rochester;University of Rochester;University of Rochester;Ask.com

  • Venue:
  • USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Memory hardware reliability is an indispensable part of whole-system dependability. This paper presents the collection of realistic memory hardware error traces (including transient and non-transient errors) from production computer systems with more than 800GB memory for around nine months. Detailed information on the error addresses allows us to identify patterns of single-bit, row, column, and whole-chip memory errors. Based on the collected traces, we explore the implications of different hardware ECC protection schemes so as to identify the most common error causes and approximate error rates exposed to the software level. Further, we investigate the software system susceptibility to major error causes, with the goal of validating, questioning, and augmenting results of prior studies. In particular, we find that the earlier result that most memory hardware errors do not lead to incorrect software execution may not be valid, due to the unrealistic model of exclusive transient errors. Our study is based on an efficient memory error injection approach that applies hardware watchpoints on hotspot memory regions.