A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

Authors:
Yuyang Du;Hongliang Yu;Yunhong Jiang;Yaozu Dong;Weimin Zheng
Affiliations:
Department of Computer Science and Technology, Tsinghua University;Department of Computer Science and Technology, Tsinghua University;Intel Research and Development, Asia-Pacific;Intel Research and Development, Asia-Pacific;Department of Computer Science and Technology, Tsinghua University
Venue:
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
Year:
2010

Citing 18
Cited 1

Implementing fault-tolerant services using the state machine approach: a tutorial

ACM Computing Surveys (CSUR)
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
Field testing for cosmic ray soft errors in semiconductor memories

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Using Memory Errors to Attack a Virtual Machine

SP '03 Proceedings of the 2003 IEEE Symposium on Security and Privacy
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Susceptibility of Commodity Systems and Software to Memory Soft Errors

IEEE Transactions on Computers
Soft Errors in Advanced Computer Systems

IEEE Design & Test
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Disk failures in the real world: what does an MTTF of 1,000,000 hours mean to you?

FAST '07 Proceedings of the 5th USENIX conference on File and Storage Technologies
PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers

PRDC '07 Proceedings of the 13th Pacific Rim International Symposium on Dependable Computing
A memory soft error measurement on production systems

ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Implementing high availability memory with a duplication cache

Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
DRAM errors in the wild: a large-scale field study

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds

Proceedings of the 16th ACM conference on Computer and communications security
A realistic evaluation of memory hardware errors and software system susceptibility

USENIXATC'10 Proceedings of the 2010 USENIX conference on USENIX annual technical conference

kMemvisor: flexible system wide memory mirroring in virtual environments

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Memory is the most frequently failing component that can cause system crash, which significantly affects the emerging data centers that are based on system virtualization (e.g., clouds). Such environment differs from previously studied large systems and thus poses renewed challenge to the reliability, availability, and serviceability (RAS) of today's production site that hosts a large population of commodity servers. The paper advocates addressing this problem by exploiting memory error characteristics and employing a cost-effective self-healing mechanism. Specifically, we propose a memory error prediction and prevention model, which takes as input error events and system utilization, assesses memory error risk, and manipulates memory mappings accordingly (by page/DIMM replacement or VM live migration) to avoid potential damage and loss.