A dual process redundancy approach to transient fault tolerance for ccNUMA architecture

  • Authors:
  • Xingjun Zhang;Endong Wang;Feilong Tang;Meishun Yang;Hengyi Wei;Xiaoshe Dong

  • Affiliations:
  • -;-;-;-;-;-

  • Venue:
  • Neurocomputing
  • Year:
  • 2013

Quantified Score

Hi-index 0.01

Visualization

Abstract

Transient fault is a critical concern in the reliability of microprocessor system. The software fault tolerance is more flexible and lower in cost than the hardware fault tolerance. And also, as architectural trends point toward multicore designs, there is substantial interest in adapting parallel and redundancy hardware resources for transient fault tolerance. The paper proposes a process-level fault tolerance technique, a software-centric approach, which efficiently schedules and synchronizes redundancy processes with ccNUMA processors redundancy. So it can improve efficiency of redundancy processes running and reduce time and space overhead. The paper focuses on the researching of redundancy processes error detection and handling method. A real prototype is implemented that is designed to be transparent to the application. The test results show that the system can timely detect soft errors of CPU and memory that cause the redundancy processes exception, and meanwhile ensure that the services of the application are uninterrupted and delayed shortly.