A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

  • Authors:
  • Yuyang Du;Hongliang Yu;Yunhong Jiang;Yaozu Dong;Weimin Zheng

  • Affiliations:
  • Department of Computer Science and Technology, Tsinghua University;Department of Computer Science and Technology, Tsinghua University;Intel Research and Development, Asia-Pacific;Intel Research and Development, Asia-Pacific;Department of Computer Science and Technology, Tsinghua University

  • Venue:
  • HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Memory is the most frequently failing component that can cause system crash, which significantly affects the emerging data centers that are based on system virtualization (e.g., clouds). Such environment differs from previously studied large systems and thus poses renewed challenge to the reliability, availability, and serviceability (RAS) of today's production site that hosts a large population of commodity servers. The paper advocates addressing this problem by exploiting memory error characteristics and employing a cost-effective self-healing mechanism. Specifically, we propose a memory error prediction and prevention model, which takes as input error events and system utilization, assesses memory error risk, and manipulates memory mappings accordingly (by page/DIMM replacement or VM live migration) to avoid potential damage and loss.