A Flexible Approach to Improving System Reliability with Virtual Lockstep

  • Authors:
  • Casey M. Jeffery;Renato J. O. Figueiredo

  • Affiliations:
  • University of Florida, Gainesville;University of Florida, Gainesville

  • Venue:
  • IEEE Transactions on Dependable and Secure Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

There is an increasing need for fault tolerance capabilities in logic devices brought about by the scaling of transistors to ever smaller geometries. This paper presents a hypervisor-based replication approach that can be applied to commodity hardware to allow for virtually lockstepped execution. It offers many of the benefits of hardware-based lockstep while being cheaper and easier to implement and more flexible in the configurations supported. A novel form of processor state fingerprinting is also presented, which can significantly reduce the fault detection latency. This further improves reliability by triggering rollback recovery before errors are recorded to a checkpoint. The mechanisms are validated using a full prototype and the benchmarks considered indicate an average performance overhead of approximately 14 percent with the possibility for significant optimization. Finally, a unique method of using virtual lockstep for fault injection testing is presented and used to show that significant detection latency reduction is achievable by comparing only a small amount of data across replicas.