Generalized Algorithm-Based Fault Tolerance: Error Correction via Kalman Estimation

  • Authors:
  • G. Robert Redinbo

  • Affiliations:
  • Univ. of California, Davis

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 1998

Quantified Score

Hi-index 14.99

Visualization

Abstract

An extension to Algorithm-Based Fault Tolerance (ABFT) methodologies shows how parity values dictated by a real convolutional code can be employed by Kalman estimation techniques to perform real number correction for protecting linear processing systems. Intermittent failures appearing in the output samples are detected and corrected using only the syndromes normally generated in ABFT schemes. The algebraic structure of a real convolutional code provides separation needed by recursive Kalman state estimators to affect mean-square error correction. State and parity measurement equations model faults and computational noise in both the linear processing and parity generation subassemblies, and, in a departure from previous models, the noise sources are considered time-varying. The Kalman one-step estimator which makes decisions on all parity values up to the present point is determined, and it separates naturally into detection and correction operations permitting corrective action only when the detection levels exceed thresholds based on roundoff noise energy. The detector/corrector uses efficient multirate block processing techniques as determined by the real convolutional code.A smoothed fixed-lag Kalman estimator which uses parity values for a fixed amount beyond the point of interest is needed to complete the correction. It employs one-step estimator quantities and implementation simplifications are possible. Examples showing the correction behavior and mean-square error performance are presented, and the size of overhead calculations for detection and correction is estimated. A protected processing system is constructed by introducing additional subassemblies, mostly comparators, with the detection and correction parts already described. Under the usual assumptions of at most a single subassembly failure, no improperly detected or corrected data leave the overall protected configuration.