ExtraVirt: detecting and recovering from transient processor faults

Authors:
Dominic Lucchetti;Steven K. Reinhardt;Peter M. Chen
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
Proceedings of the twentieth ACM symposium on Operating systems principles
Year:
2005

Citing 6
Cited 7

Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
The Soft Error Problem: An Architectural Perspective

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
SWIFT: Software Implemented Fault Tolerance

Proceedings of the international symposium on Code generation and optimization
Memory resource management in VMware ESX server

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
ReVirt: enabling intrusion analysis through virtual-machine logging and replay

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading

A regulated transitive reduction (RTR) for longer memory race recording

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Rerun: Exploiting Episodes for Lightweight Memory Race Recording

ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
Partitioning techniques for partially protected caches in resource-constrained embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Karma: scalable deterministic record-replay

Proceedings of the international conference on Supercomputing
Deterministic replay for message-passing-based concurrent programs

ACM Transactions on Design Automation of Electronic Systems (TODAES) - Special section on verification challenges in the concurrent world
CCTR: An efficient point-to-point memory race recorder implemented in chunks

Microprocessors & Microsystems
An efficient deterministic record-replay with separate dependencies

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reliability is becoming an increasingly important issue in modern processor design. Smaller feature sizes and more numerous transistors are projected to increase the frequency of transient faults [4, 5]. Our project, ExtraVirt, leverages the trend toward multi-core and multi-processor systems to survive these transient faults. Our goals are (1) to add fault tolerance without modifying existing operating systems, applications or hardware, (2) to minimize the time spent executing software that cannot tolerate faults, and (3) to minimize the time and space overhead needed to detect and recover from faults. We accomplish these goals by leveraging virtual-machine technology and by sharing memory and I/O devices across replicas. ExtraVirt extends prior work on VM-level fault tolerance[2] by detecting and recovering from non-fail-stop faults and by running multiple replicas efficiently on a single machine.