A Fault-Tolerant Scheduling Algorithm for Real-Time Periodic Tasks with Possible Software Faults

  • Authors:
  • Ching-Chih Han;Kang G. Shin;Jian Wu

  • Affiliations:
  • -;-;-

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2003

Quantified Score

Hi-index 14.98

Visualization

Abstract

A hard real-time system is usually subject to stringent reliability and timing constraints since failure to produce correct results in a timely manner may lead to a disaster. One way to avoid missing deadlines is to trade the quality of computation results for timeliness and software fault tolerance is often achieved with the use of redundant programs. A deadline mechanism which combines these two methods is proposed to provide software fault tolerance in hard real-time periodic task systems. Specifically, we consider the problem of scheduling a set of real-time periodic tasks each of which has two versions: primary and alternate. The primary version contains more functions (thus more complex) and produces good quality results, but its correctness is more difficult to verify because of its high level of complexity and resource usage. By contrast, the alternate version contains only the minimum required functions (thus simpler) and produces less precise, but acceptable results and its correctness is easy to verify. We propose a scheduling algorithm which 1) guarantees either the primary or alternate version of each critical task to be completed in time and 2) attempts to complete as many primaries as possible. Our basic algorithm uses a fixed priority-driven preemptive scheduling scheme to preallocate time intervals to the alternates and, at runtime, attempts to execute primaries first. An alternate will be executed only 1) if its primary fails due to lack of time or manifestation of bugs or 2) when the latest time to start execution of the alternate without missing the corresponding task deadline is reached. This algorithm is shown to be effective and easy to implement. This algorithm is enhanced further to prevent early failures in executing primaries from triggering failures in the subsequent job executions, thus improving efficiency of processor usage.