On-line error detection and fast recover techniques for dependable embedded processors

Authors:
Matthias Pflanz
Affiliations:
IBM Deutschland Entwicklung GmbH, Department of Processor Development II, Böblingen, Germany
Venue:
On-line error detection and fast recover techniques for dependable embedded processors
Year:
2002

Citing 37
Cited 4

Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Processor Control Flow Monitoring Using Signatured Instruction Streams

IEEE Transactions on Computers
Design & analysis of fault tolerant digital systems

Design & analysis of fault tolerant digital systems
High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback

IEEE Transactions on Computers
Fast hierarchical multi-level fault simulation of sequential circuits with switch-level accuracy

DAC '93 Proceedings of the 30th international Design Automation Conference
An efficient procedure for the synthesis of fast self-testable controller structures

ICCAD '94 Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design
Fault-tolerant computer system design

Fault-tolerant computer system design
Electromigration: the time bomb in deep-submicron ICs

IEEE Spectrum
Fault behavior observation of a microprocessor system through a VHDL simulation-based fault injection experiment

EURO-DAC '96/EURO-VHDL '96 Proceedings of the conference on European design automation
Computer organization and design (2nd ed.): the hardware/software interface

Computer organization and design (2nd ed.): the hardware/software interface
On-line testing for VLSI: state of the art and trends

Integration, the VLSI Journal - Special issue on VLSI testing
Self recovering controller and datapath codesign

DATE '99 Proceedings of the conference on Design, automation and test in Europe
Test Routines Based on Symbolic Logical Statements

Journal of the ACM (JACM)
System chip test: how will it impact your design?

Proceedings of the 37th Annual Design Automation Conference
Test challenges for deep sub-micron technologies

Proceedings of the 37th Annual Design Automation Conference
Self-checking and fault-tolerant digital design

Self-checking and fault-tolerant digital design
A register-transfer-level fault simulator for permanent and transient faults in embedded processors

Proceedings of the conference on Design, automation and test in Europe
Digital Logic and Computer Design

Digital Logic and Computer Design
Fault Injection

Computer
Fault-Secure Parity Prediction Arithmetic Operators

IEEE Design & Test
Introducing Core-Based System Design

IEEE Design & Test
Design Challenges for New Application-Specific Processors

IEEE Design & Test
Generating Reliable Embedded Processors

IEEE Micro
Soft-Error Detection through Software Fault-Tolerance Techniques

DFT '99 Proceedings of the 14th International Symposium on Defect and Fault-Tolerance in VLSI Systems
A Study of the Error Behavior of a 32-bit RISC Subjected to Simulated Transient Fault Injection

Proceedings of the IEEE International Test Conference on Discover the New World of Test and Design
Testing embedded-core based system chips

ITC '98 Proceedings of the 1998 IEEE International Test Conference
CMOS Bridges and Resistive Transistor Faults: IDDQ versus Delay Effects

Proceedings of the IEEE International Test Conference on Designing, Testing, and Diagnostics - Join Them
Mixed Level Hierarchical Test Generation for Transition Faults and Overcurrent Related Defects

Proceedings of the IEEE International Test Conference on Discover the New World of Test and Design
Error Detection in Fault Secure Controllers using State Encoding

EDTC '96 Proceedings of the 1996 European conference on Design and Test
Testing for bridging faults (shorts) in CMOS circuits

DAC '83 Proceedings of the 20th Design Automation Conference
A new method for on-line state machine observation for embedded microprocessors

HLDVT '00 Proceedings of the IEEE International High-Level Validation and Test Workshop (HLDVT'00)
WHICH CONCURRENT ERROR DETECTION SCHEME TO CHOOSE?

ITC '00 Proceedings of the 2000 IEEE International Test Conference
On-line Error Detection Techniques for Dependable Embedded Processors with High Complexity

IOLTW '01 Proceedings of the Seventh International On-Line Testing Workshop
An Efficient On-line-Test and Back-up Scheme for Embedded Processors

ITC '99 Proceedings of the 1999 IEEE International Test Conference
Towards a Standard for Embedded Core Test: An Example

ITC '99 Proceedings of the 1999 IEEE International Test Conference
Finite State Machine Synthesis with Concurrent Error Detection

ITC '99 Proceedings of the 1999 IEEE International Test Conference
S/390 microprocessor design

IBM Journal of Research and Development

On-Line Techniques for Error Detection and Correction in Processor Registers with Cross-Parity Check

Journal of Electronic Testing: Theory and Applications
Evaluating coverage of error detection logic for soft errors using formal methods

Proceedings of the conference on Design, automation and test in Europe: Proceedings
Architecting web services applications for improving availability

Architecting Dependable Systems III
Fast online error detection and correction with thread signature calculae

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This thesis summarizes investigations and experiments on on-line observation and concurrent checking of processors. The objective was to detect single and/or multiple errors within one clock cycle. First, refined techniques for data-path observation were investigated. Based on an approach for an observation of an ALU by Berger code prediction (BCP), the principle was extended to observe complete data-path structures to detect unidirectional errors. The applicability of BCP to more complex data-paths with floating-point units was shown with the help of single and double-precision addition/subtraction floatingpoint-units. Therefore, prediction formulas were developed, which consider the operation in multi-stage units. The cross-parity observation technique was developed especially for the on-line observation of large register-files or control-registers. By checking row, column, and diagonal-parities, single and multiple register errors can be detected. Cross-parity vectors have a potential diagnosis capability. Due to the critical character of the processor control-logic, different techniques were developed and investigated to detect single or multiple control-signal errors within the clock-cycle of occurrence. As a simple alternative for a fault-secure controller, a duplicated control-logic was implemented. The identification of control-word differences can be used for error-weighting a subsequent control and, finally, for further recovery strategies. As a practical solution for small processors, a triplicated structure was investigated. With it, a fault-tolerant generation of control-signals to compensate transient errors until the first permanent error was possible. An application-driven reduction (ADR) of control-logic was proposed to decrease the overhead, especially for embedded systems with standard CISCs and a limited number of applications. To detect control-signal errors, a new approach was taken by the processor state machine. To solve the problem with the complexity of state-spaces of common microprocessors, active control-signals were considered as a definitive representation of a current processor activity - the processor state. Access to all control-signals being assured and transitions being neglected, a combinatorial observation could be realized. Control-signals were encoded to a state-code, which represents the current (legal) state of the processor. With an access to control-signal conditions (instruction, time, flag-variables), a controller-independent generation (prediction) of the same code was realized. A comparison of both identifies an illegal state-code. To manage more complex state machines, an application-driven reduced state-encoder or a state-space partitioning was proposed. For pipeline structures, a partitioned observation of states was implemented as an example. As a consequence of a successful error detection within the same clock cycle, fast recovery techniques of the processor state were investigated. Starting from the positive oriented assumption that an error has a transient character, a fast repetition (rollback) of erroneous cycle(s) can deliver correct results. Time-intervals of many thousands of cycles in classical roll-back techniques can not satisfy demands for safety-critical applications. Therefore, a shorter time (rollback distance) for recovery was implemented by micro-rollback strategies. Recent approaches to micro-rollback can recover the corresponding structure in case of a transient error. But this technique fails in the case of permanent errors. Therefore, a double-processor architecture was investigated. The master-trailer structure turns out to be a suitable solution for small processors. The trailer is delayed for one cycle. With this plus on-line checked master, a fast repair (2 cycles) of transient errors can be executed by a backup of all master-registers by their counterparts in the trailer. The advantage is the function-takeover (3 cycles) in the case of a permanent-error occurrence. For pipeline processors, a further-developed rollback technique considers on one hand dynamical execution lengths for different stages, and on the other hand different error weightings. Therefore, a priority control was proposed to manage different rollbackactions (necessary rollback distances) for the recovery of the pipeline. Possible are one-cycle micro-rollback, a pipeline stage-rollback, and a macro-rollback by refilling the whole pipeline. In the worst case (lost all stored return points), a program reexecution is realized. Proposed on-line error detection and fast recovery techniques should be a supplement to other methods. In combination with other on-line observation principles, and/or with a combined hardware-software (self-)test, these techniques are used to fulfill a complete self-check scheme for an embedded processor. Strategies for a static or dynamic (micro-) rollback are a useful solution for processor errors due to transient faults of non-recurring characteristics. Then an executed program can be continued as quickly as the implemented structure allows. The overall approach for efficient on-line checking and fast recovery techniques enhances processor availability and improves the dependability of an embedded system at very reasonable additional costs.