Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Computer architecture (2nd ed.): a quantitative approach
Computer architecture (2nd ed.): a quantitative approach
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
SMTp: An Architecture for Next-generation Scalable Multi-threading
Proceedings of the 31st annual international symposium on Computer architecture
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor
Proceedings of the 31st annual international symposium on Computer architecture
Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
SWIFT: Software Implemented Fault Tolerance
Proceedings of the international symposium on Code generation and optimization
Assessing Fault Sensitivity in MPI Applications
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Design and Evaluation of Hybrid Fault-Detection Systems
Proceedings of the 32nd annual international symposium on Computer Architecture
Opportunistic Transient-Fault Detection
Proceedings of the 32nd annual international symposium on Computer Architecture
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Autonomic Microprocessor Execution via Self-Repairing Arrays
IEEE Transactions on Dependable and Secure Computing
Performance/Watt: the new server focus
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Software-controlled fault tolerance
ACM Transactions on Architecture and Code Optimization (TACO)
Self-checking instructions: reducing instruction redundancy for concurrent error detection
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
Static typing for a faulty lambda calculus
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SlicK: slice-based locality exploitation for efficient redundant multithreading
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Reunion: Complexity-Effective Multicore Redundancy
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Configurable isolation: building high availability systems with commodity multi-core processors
Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for bounding vulnerabilities of processor structures
Proceedings of the 34th annual international symposium on Computer architecture
Dynamic prediction of architectural vulnerability from microarchitectural state
Proceedings of the 34th annual international symposium on Computer architecture
Online diagnosis of hard faults in microprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Fault-tolerant typed assembly language
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
Proceedings of the International Symposium on Code Generation and Optimization
Modeling and improving data cache reliability: 1
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Hierarchical Verification for Increasing Performance in Reliable Processors
Journal of Electronic Testing: Theory and Applications
Communications of the ACM - Web science
Efficient fault tolerance in multi-media applications through selective instruction replication
Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Implementing high availability memory with a duplication cache
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Self-recovery in server programs
Proceedings of the 2009 international symposium on Memory management
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiler-assisted soft error detection under performance and energy constraints in embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Sequential element design with built-in soft error resilience
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
End-to-end register data-flow continuous self-test
Proceedings of the 36th annual international symposium on Computer architecture
Instruction-Level Fault Tolerance Configurability
Journal of Signal Processing Systems
AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware
SAFECOMP '09 Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Selective replication: A lightweight technique for soft errors
ACM Transactions on Computer Systems (TOCS)
Improving chip multiprocessor reliability through code replication
Computers and Electrical Engineering
A compiler-based infrastructure for fault-tolerant co-design
Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
Modeling soft errors for data caches and alleviating their effects on data reliability
Microprocessors & Microsystems
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
A unified online fault detection scheme via checking of stability violation
Proceedings of the Conference on Design, Automation and Test in Europe
On the exploitation of narrow-width values for improving register file reliability
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
A self-checking hardware journal for a fault-tolerant processor architecture
International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Trade-offs in transient fault recovery schemes for redundant multithreaded processors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Exploiting inactive rename slots for detecting soft errors
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Runtime asynchronous fault tolerance via speculation
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
Proceedings of the ACM International Conference on Computing Frontiers
FaulTM: error detection and recovery using hardware transactional memory
Proceedings of the Conference on Design, Automation and Test in Europe
IVF: characterizing the vulnerability of microprocessor structures to intermittent faults
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
A survey of checker architectures
ACM Computing Surveys (CSUR)
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Hi-index | 0.00 |
We propose a scheme for transient-fault recovery called Simultaneously and Redundantly Threaded processors with Recovery (SRTR) that enhances a previously proposed scheme for transient-fault detection, called Simultaneously and Redundantly Threaded (SRT) processors. SRT replicates an application into two communicating threads, one executing ahead of the other. The trailing thread repeats the computation performed by the leading thread, and the values produced by the two threads are compared. In SRT, a leading instruction may commit before the check for faults occurs, relying on the trailing thread to trigger detection. In contrast, SRTR must not allow any leading instruction to commit before checking occurs, since a faulty instruction cannot be undone once the instruction commits.To avoid stalling leading instructions at commit while waiting for their trailing counterparts, SRTR exploits the time between the completion and commit of leading instructions. SRTR compares the leading and trailing values as soon as the trailing instruction completes, typically before the leading instruction reaches the commit point. To avoid increasing the bandwidth demand on the register file for checking register values, SRTR uses the register value queue (RVQ) to hold register values for checking. To reduce the bandwidth pressure on the RVQ itself, SRTR employs dependence-based checking elision (DBCE). By reasoning that faults propagate through dependent instructions, DBCE exploits register (true) dependence chains so that only the last instruction in a chain uses the RVQ, and has the leading and trailing values checked. SRTR performs within 1% and 7% of SRT for SPEC95 integer and floating-point programs, respectively: While SRTR without DBCE incurs about 18% performance loss when the number of RVQ ports is reduced from four (which is performance-equivalent to an unlimited number) to two ports, with DBCE, a two-ported RVQ performs within 2% of a four-ported RVQ.