A self-checking hardware journal for a fault-tolerant processor architecture

Authors:
Mohsin Amin;Abbas Ramazani;Fabrice Monteiro;Camille Diou;Abbas Dandache
Affiliations:
LICM Laboratory, University Paul Verlaine, Metz, Metz, France;Electrical Engineering Department, Engineering Faculty Lorestan, University Khorramabad, Iran;LICM Laboratory, University Paul Verlaine, Metz, Metz, France;LICM Laboratory, University Paul Verlaine, Metz, Metz, France;LICM Laboratory, University Paul Verlaine, Metz, Metz, France
Venue:
International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Year:
2011

Citing 31
Cited 0

Stack computers: the new wave

Stack computers: the new wave
Virtual Checkpoints: Architecture and Performance

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
IBM experiments in soft fails in computer electronics (1978–1994)

IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Fault-tolerant computer system design

Fault-tolerant computer system design
Heterogeneous Simulation—Mixing Discrete-Event Models with Dataflow

Journal of VLSI Signal Processing Systems - Special issue on the rapid prototyping of application specific signal processors (RASSP) program
COFTA: Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance

IEEE Transactions on Computers
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems

IEEE Transactions on Computers
Analysis of Checkpointing for Real-Time Systems

Real-Time Systems
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Fault Injection

Computer
IBM's S/390 G5 Microprocessor Design

IEEE Micro
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems

IEEE Transactions on Computers
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The architecture of Tandem's NonStop system

ACM '81 Proceedings of the ACM '81 conference
Scheduling Fault-Tolerant Distributed Hard Real-Time Tasks Independently of the Replication Strategies

RTCSA '99 Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications
Roll-forward error recovery in embedded real-time systems

ICPADS '96 Proceedings of the 1996 International Conference on Parallel and Distributed Systems
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Trends and Challenges in VLSI Circuit Reliability

IEEE Micro
Reliability-Aware Co-Synthesis for Embedded Systems

ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Basic Concepts and Taxonomy of Dependable and Secure Computing

IEEE Transactions on Dependable and Secure Computing
Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation

IEEE Micro
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
A class of optimal minimum odd-weight-column SEC-DED codes

IBM Journal of Research and Development
Error-correcting codes for semiconductor memory applications: a state-of-the-art review

IBM Journal of Research and Development
Fault-tolerant average execution time optimization for general-purpose multi-processor system-on-chips

Proceedings of the Conference on Design, Automation and Test in Europe
Transparent recovery from intermittent faults in time-triggered distributed systems

IEEE Transactions on Computers
A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a specialized self-checking hardware journal being used as a centerpiece in our design strategy to build a processor tolerant to transient faults. Fault tolerance here relies on the use of error detection techniques in the processor core together with journalization and rollback execution to recover from erroneous situations. Effective rollback recovery is possible thanks to using a hardware journal and chosing a stack computing architecture for the processor core instead of the usual RISC or CISC. The main objective of the journalization and the hardware self-checking journal is to prevent data not yet validated to be sent to the main memory, and allow to fast rollback execution on faulty situations. The main memory, supposed to be fault secure in our model, only contains valid (uncorrupted) data obtained from fault-free computations. Error control coding techniques are used both in the processor core to detect errors and in the HW journal to protect the temporarily stored data from possible changes induced by transient faults. Implementation results on an FPGA of the Altera Stratix-II family show clearly the relevance of the approach, both in terms of performance/area tradeoff and fault tolerance effectiveness, even for high error rates.