Exploiting replicated checkpoints for soft error detection and correction

Authors:
Fahrettin Koc;Kenan Bozdas;Burak Karsli;Oguz Ergin
Affiliations:
TOBB University of Economics and Technology, Ankara, Turkey;TOBB University of Economics and Technology, Ankara, Turkey;TOBB University of Economics and Technology, Ankara, Turkey;TOBB University of Economics and Technology, Ankara, Turkey
Venue:
Proceedings of the Conference on Design, Automation and Test in Europe
Year:
2013

Citing 14
Cited 0

Checkpoint repair for out-of-order execution machines

ISCA '87 Proceedings of the 14th annual international symposium on Computer architecture
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
The Alpha 21264 Microprocessor

IEEE Micro
A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor

Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Organization and implementation of the register-renaming mapper for out-of-order IBM POWER4 processors

IBM Journal of Research and Development - Electrochemical technology in microelectronics
Soft Errors in Advanced Computer Systems

IEEE Design & Test
Using Register Lifetime Predictions to Protect Register Files against Soft Errors

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
On the latency, energy and area of checkpointed, superscalar register alias tables

ISLPED '07 Proceedings of the 2007 international symposium on Low power electronics and design
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores

IEEE Micro
A physical level study and optimization of CAM-based checkpointed register alias table

Proceedings of the 13th international symposium on Low power electronics and design
An energy-efficient checkpointing mechanism for out of order commit processor

Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design
The use of triple-modular redundancy to improve computer reliability

IBM Journal of Research and Development
On the latency and energy of checkpointed superscalar register alias tables

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
The gem5 simulator

ACM SIGARCH Computer Architecture News

Quantified Score

Hi-index	0.00

Visualization

Abstract

Register renaming is a widely used technique to remove false dependencies in contemporary superscalar microprocessors. A register alias table (RAT) is formed to hold current locations of the values that correspond to the architectural registers. Some recently designed processors take a copy of the rename table at each branch instruction, in order to recover its contents when a misspeculation occurs. In this paper first we investigate the RAT vulnerability against transient errors. Then we analyze the vulnerability of RAT checkpoints and propose two techniques for soft error detection and correction utilizing redundantly taken copies of the entries whose content is the same with the previous and/or next checkpoints. Simulation results of the spec 2006 benchmarks reveal that on the average RAT vulnerability is 25% and checkpoint vulnerability is 6%. Results also reveal that redundancy exists at sequential checkpoint copies and can be used for error detection and correction purposes. We propose techniques that exploit this redundancy and show that faults in 41% of all checkpoints and 44% of rolled-back checkpoints can be detected and errors in 33% of the rolled-back checkpoints can be corrected. Since we exploit the already available storage, proposed error detection and correction techniques can be implemented with minimal hardware overhead.