Transient Fault Tolerance for ccNUMA Architecture

  • Authors:
  • Xingjun Zhang;Endong Wang;Feilong Tang;Meishun Yang;Hengyi Wei;Xiaoshe Dong

  • Affiliations:
  • -;-;-;-;-;-

  • Venue:
  • IMIS '12 Proceedings of the 2012 Sixth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Transient fault is a critical concern in the reliability of microprocessors system. The software fault tolerance is more flexible and lower cost than the hardware fault tolerance. And also, as architectural trends point toward multi core designs, there is substantial interest in adapting parallel and redundancy hardware resources for transient fault tolerance. The paper proposes a process-level fault tolerance technique, a software centric approach, which efficiently schedule and synchronize of redundancy processes with ccNUMA processors redundancy. So it can improve efficiency of redundancy processes running, and reduce time and space overhead. The paper focuses on the researching of redundancy processes error detection and handling method. A real prototype is implemented that is designed to be transparent to the application. The test results show that the system can timely detect soft errors of CPU and memory that cause the redundancy processes exception, and meanwhile ensure that the services of application is uninterrupted and delay shortly.