Stack computers: the new wave
Virtual Checkpoints: Architecture and Performance
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
IBM experiments in soft fails in computer electronics (1978–1994)
IBM Journal of Research and Development - Special issue: terrestrial cosmic rays and soft errors
Fault-tolerant computer system design
Fault-tolerant computer system design
Heterogeneous Simulation—Mixing Discrete-Event Models with Dataflow
Journal of VLSI Signal Processing Systems - Special issue on the rapid prototyping of application specific signal processors (RASSP) program
IEEE Transactions on Computers
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems
IEEE Transactions on Computers
Analysis of Checkpointing for Real-Time Systems
Real-Time Systems
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Computer
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
IEEE Transactions on Computers
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The architecture of Tandem's NonStop system
ACM '81 Proceedings of the ACM '81 conference
RTCSA '99 Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications
Roll-forward error recovery in embedded real-time systems
ICPADS '96 Proceedings of the 1996 International Conference on Parallel and Distributed Systems
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Reliability-Aware Co-Synthesis for Embedded Systems
ASAP '04 Proceedings of the Application-Specific Systems, Architectures and Processors, 15th IEEE International Conference
Basic Concepts and Taxonomy of Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
A class of optimal minimum odd-weight-column SEC-DED codes
IBM Journal of Research and Development
Error-correcting codes for semiconductor memory applications: a state-of-the-art review
IBM Journal of Research and Development
Proceedings of the Conference on Design, Automation and Test in Europe
Transparent recovery from intermittent faults in time-triggered distributed systems
IEEE Transactions on Computers
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.00 |
We introduce a specialized self-checking hardware journal being used as a centerpiece in our design strategy to build a processor tolerant to transient faults. Fault tolerance here relies on the use of error detection techniques in the processor core together with journalization and rollback execution to recover from erroneous situations. Effective rollback recovery is possible thanks to using a hardware journal and chosing a stack computing architecture for the processor core instead of the usual RISC or CISC. The main objective of the journalization and the hardware self-checking journal is to prevent data not yet validated to be sent to the main memory, and allow to fast rollback execution on faulty situations. The main memory, supposed to be fault secure in our model, only contains valid (uncorrupted) data obtained from fault-free computations. Error control coding techniques are used both in the processor core to detect errors and in the HW journal to protect the temporarily stored data from possible changes induced by transient faults. Implementation results on an FPGA of the Altera Stratix-II family show clearly the relevance of the approach, both in terms of performance/area tradeoff and fault tolerance effectiveness, even for high error rates.