Fault detection in multi-core processors using chaotic maps

  • Authors:
  • Nageswara S.V. Rao

  • Affiliations:
  • Oak Ridge National Laboratory, Oak Ridge, TN, USA

  • Venue:
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Exascale systems built using multi-core processors are expected to experience several component faults during code executions lasting for hours. It is important to detect faults in processor cores so that faulty cores can be removed from scheduler pools, nodes with high failures can be swapped out, applications can be migrated, and check-point recoveries can be initiated. We propose light-weight codes that utilize chaotic computations and customized threads to detect component faults in multi-core processors. They concurrently execute dedicated threads that implement Poincare and identity maps, which are customized to isolate faults in arithmetic operations, memory elements and interconnects. The instruction execution errors and local memory errors are detected by threads dedicated to processor cores, and errors in inter-processor crossconnects are detected by global-local memory movements. We present preliminary implementation results on 4- and 48-core HP workstations under simulated faults.