On-line error detection and fast recover techniques for dependable embedded processors

  • Authors:
  • Matthias Pflanz

  • Affiliations:
  • IBM Deutschland Entwicklung GmbH, Department of Processor Development II, Böblingen, Germany

  • Venue:
  • On-line error detection and fast recover techniques for dependable embedded processors
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This thesis summarizes investigations and experiments on on-line observation and concurrent checking of processors. The objective was to detect single and/or multiple errors within one clock cycle. First, refined techniques for data-path observation were investigated. Based on an approach for an observation of an ALU by Berger code prediction (BCP), the principle was extended to observe complete data-path structures to detect unidirectional errors. The applicability of BCP to more complex data-paths with floating-point units was shown with the help of single and double-precision addition/subtraction floatingpoint-units. Therefore, prediction formulas were developed, which consider the operation in multi-stage units. The cross-parity observation technique was developed especially for the on-line observation of large register-files or control-registers. By checking row, column, and diagonal-parities, single and multiple register errors can be detected. Cross-parity vectors have a potential diagnosis capability. Due to the critical character of the processor control-logic, different techniques were developed and investigated to detect single or multiple control-signal errors within the clock-cycle of occurrence. As a simple alternative for a fault-secure controller, a duplicated control-logic was implemented. The identification of control-word differences can be used for error-weighting a subsequent control and, finally, for further recovery strategies. As a practical solution for small processors, a triplicated structure was investigated. With it, a fault-tolerant generation of control-signals to compensate transient errors until the first permanent error was possible. An application-driven reduction (ADR) of control-logic was proposed to decrease the overhead, especially for embedded systems with standard CISCs and a limited number of applications. To detect control-signal errors, a new approach was taken by the processor state machine. To solve the problem with the complexity of state-spaces of common microprocessors, active control-signals were considered as a definitive representation of a current processor activity - the processor state. Access to all control-signals being assured and transitions being neglected, a combinatorial observation could be realized. Control-signals were encoded to a state-code, which represents the current (legal) state of the processor. With an access to control-signal conditions (instruction, time, flag-variables), a controller-independent generation (prediction) of the same code was realized. A comparison of both identifies an illegal state-code. To manage more complex state machines, an application-driven reduced state-encoder or a state-space partitioning was proposed. For pipeline structures, a partitioned observation of states was implemented as an example. As a consequence of a successful error detection within the same clock cycle, fast recovery techniques of the processor state were investigated. Starting from the positive oriented assumption that an error has a transient character, a fast repetition (rollback) of erroneous cycle(s) can deliver correct results. Time-intervals of many thousands of cycles in classical roll-back techniques can not satisfy demands for safety-critical applications. Therefore, a shorter time (rollback distance) for recovery was implemented by micro-rollback strategies. Recent approaches to micro-rollback can recover the corresponding structure in case of a transient error. But this technique fails in the case of permanent errors. Therefore, a double-processor architecture was investigated. The master-trailer structure turns out to be a suitable solution for small processors. The trailer is delayed for one cycle. With this plus on-line checked master, a fast repair (2 cycles) of transient errors can be executed by a backup of all master-registers by their counterparts in the trailer. The advantage is the function-takeover (3 cycles) in the case of a permanent-error occurrence. For pipeline processors, a further-developed rollback technique considers on one hand dynamical execution lengths for different stages, and on the other hand different error weightings. Therefore, a priority control was proposed to manage different rollbackactions (necessary rollback distances) for the recovery of the pipeline. Possible are one-cycle micro-rollback, a pipeline stage-rollback, and a macro-rollback by refilling the whole pipeline. In the worst case (lost all stored return points), a program reexecution is realized. Proposed on-line error detection and fast recovery techniques should be a supplement to other methods. In combination with other on-line observation principles, and/or with a combined hardware-software (self-)test, these techniques are used to fulfill a complete self-check scheme for an embedded processor. Strategies for a static or dynamic (micro-) rollback are a useful solution for processor errors due to transient faults of non-recurring characteristics. Then an executed program can be continued as quickly as the implemented structure allows. The overall approach for efficient on-line checking and fast recovery techniques enhances processor availability and improves the dependability of an embedded system at very reasonable additional costs.