Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Optimal Discrimination between Transient and Permanent Faults
HASE '98 The 3rd IEEE International Symposium on High-Assurance Systems Engineering
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Reunion: Complexity-Effective Multicore Redundancy
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Reliability: Fallacy or Reality?
IEEE Micro
Adapting to intermittent faults in multicore systems
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Performance-asymmetry-aware scheduling for Chip Multiprocessors with static core coupling
Journal of Systems Architecture: the EUROMICRO Journal
A survey of checker architectures
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Reliability has become a first-class consideration issue for architects along with performance and energy-efficiency. The increasing scaling technology and subsequent supply voltage reductions, together with temperature fluctuations, augment the susceptibility of architectures to errors. Previous approaches have tried to provide fault tolerance. However, they usually present critical drawbacks concerning either hardware duplication or performance degradation, which for the majority of common users results unacceptable. RMT (Redundant Multi-Threading) is a family of techniques based on SMT processors in which two independent threads (master and slave), fed with the same inputs, redundantly execute the same instructions, in order to detect faults by checking their outputs. In this paper, we study the under-explored architectural support of RMT techniques to reliably execute shared-memory applications. We show how atomic operations induce to serialization points between master and slave threads. This bottleneck has an impact of 34% in execution time for several parallel scientific benchmarks. To address this issue, we present REPAS (Reliable execution of Parallel ApplicationS in tiled-CMPs), a novel RMT mechanism to provide reliable execution in shared-memory applications. While previous proposals achieve the same goal by using a big amount of hardware - usually, twice the number of cores in the system - REPAS architecture only needs a few extra hardware, since the redundant execution is made within 2-way SMT cores in which the majority of hardware is shared. Our evaluation shows that REPAS is able to provide full coverage against soft-errors with a lower performance slowdown in comparison to a non-redundant system than previous proposals at the same time it uses less hardware resources.