Implementing Precise Interrupts in Pipelined Processors
IEEE Transactions on Computers
IEEE Transactions on Computers
Simultaneous multithreading: maximizing on-chip parallelism
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Increasing superscalar performance through multistreaming
PACT '95 Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Trace cache: a low latency approach to high bandwidth instruction fetching
Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture
MICRO 30 Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture
Reliable computer systems (3rd ed.): design and evaluation
Reliable computer systems (3rd ed.): design and evaluation
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Incorporating fault tolerance in superscalar processors
HIPC '96 Proceedings of the Third International Conference on High-Performance Computing (HiPC '96)
Slipstream processors: improving both performance and fault tolerance
ACM SIGPLAN Notices
Efficient checker processor design
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
A study of slipstream processors
Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Dynamically allocating processor resources between nearby and distant ILP
ISCA '01 Proceedings of the 28th annual international symposium on Computer architecture
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
A Fault Tolerant Approach to Microprocessor Design
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture
Enhancing data cache reliability by the addition of a small fully-associative replication cache
Proceedings of the 18th annual international conference on Supercomputing
Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor
Proceedings of the 31st annual international symposium on Computer architecture
Proceedings of the 31st annual international symposium on Computer architecture
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Case for Clumsy Packet Processors
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
SWIFT: Software Implemented Fault Tolerance
Proceedings of the international symposium on Code generation and optimization
Increasing Register File Immunity to Transient Errors
Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Improving java virtual machine reliability for memory-constrained embedded systems
Proceedings of the 42nd annual Design Automation Conference
Design and Evaluation of Hybrid Fault-Detection Systems
Proceedings of the 32nd annual international symposium on Computer Architecture
Opportunistic Transient-Fault Detection
Proceedings of the 32nd annual international symposium on Computer Architecture
Recursive TMR: Scaling Fault Tolerance in the Nanoscale Era
IEEE Design & Test
Compiler-guided register reliability improvement against soft errors
Proceedings of the 5th ACM international conference on Embedded software
Exploiting Coarse-Grain Verification Parallelism for Power-Efficient Fault Tolerance
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Replication Cache: A Small Fully Associative Cache to Improve Data Cache Reliability
IEEE Transactions on Computers
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Autonomic Microprocessor Execution via Self-Repairing Arrays
IEEE Transactions on Dependable and Secure Computing
Compiler-directed channel allocation for saving power in on-chip networks
Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Software-controlled fault tolerance
ACM Transactions on Architecture and Code Optimization (TACO)
Opportunistic Transient-Fault Detection
IEEE Micro
Object duplication for improving reliability
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Self-Stabilizing Microprocessor: Analyzing and Overcoming Soft Errors
IEEE Transactions on Computers
Runtime integrity checking for inter-object connections
ICCAD '05 Proceedings of the 2005 IEEE/ACM International conference on Computer-aided design
Proceedings of the 43rd annual Design Automation Conference
Self-checking instructions: reducing instruction redundancy for concurrent error detection
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
ReStore: Symptom-Based Soft Error Detection in Microprocessors
IEEE Transactions on Dependable and Secure Computing
Static typing for a faulty lambda calculus
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
SlicK: slice-based locality exploitation for efficient redundant multithreading
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Cost-efficient soft error protection for embedded microprocessors
CASES '06 Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems
Reunion: Complexity-Effective Multicore Redundancy
Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
Architecting a reliable CMP switch architecture
ACM Transactions on Architecture and Code Optimization (TACO)
Online task-scheduling for fault-tolerant low-energy real-time systems
Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design
Examining ACE analysis reliability estimates using fault-injection
Proceedings of the 34th annual international symposium on Computer architecture
Configurable isolation: building high availability systems with commodity multi-core processors
Proceedings of the 34th annual international symposium on Computer architecture
Mechanisms for bounding vulnerabilities of processor structures
Proceedings of the 34th annual international symposium on Computer architecture
Dynamic prediction of architectural vulnerability from microarchitectural state
Proceedings of the 34th annual international symposium on Computer architecture
Online diagnosis of hard faults in microprocessors
ACM Transactions on Architecture and Code Optimization (TACO)
Fault-tolerant typed assembly language
Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation
Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection
Proceedings of the International Symposium on Code Generation and Optimization
Modeling and improving data cache reliability: 1
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Microprocessors in the era of terascale integration
Proceedings of the conference on Design, automation and test in Europe
Transient fault prediction based on anomalies in processor events
Proceedings of the conference on Design, automation and test in Europe
A low-SER efficient core processor architecture for future technologies
Proceedings of the conference on Design, automation and test in Europe
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Hierarchical Verification for Increasing Performance in Reliable Processors
Journal of Electronic Testing: Theory and Applications
Efficient fault tolerance in multi-media applications through selective instruction replication
Proceedings of the 2008 workshop on Radiation effects and fault tolerance in nanometer technologies
Anomaly-based fault detection in pervasive computing system
Proceedings of the 5th international conference on Pervasive services
A Systematic Approach to Automatically Generate Multiple Semantically Equivalent Program Versions
Ada-Europe '08 Proceedings of the 13th Ada-Europe international conference on Reliable Software Technologies
Reasoning about Control Flow in the Presence of Transient Faults
SAS '08 Proceedings of the 15th international symposium on Static Analysis
Techniques for Efficient Software Checking
Languages and Compilers for Parallel Computing
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Understanding software approaches for GPGPU reliability
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Improving error tolerance for multithreaded register files
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Implementing high availability memory with a duplication cache
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Self-recovery in server programs
Proceedings of the 2009 international symposium on Memory management
ESoftCheck: Removal of Non-vital Checks for Fault Tolerance
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Compiler-assisted soft error detection under performance and energy constraints in embedded systems
ACM Transactions on Embedded Computing Systems (TECS)
Instruction-Level Fault Tolerance Configurability
Journal of Signal Processing Systems
REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Architecture Design for Soft Errors
Architecture Design for Soft Errors
Synchronizing redundant cores in a dynamic DMR multicore architecture
IEEE Transactions on Circuits and Systems II: Express Briefs
Selective replication: A lightweight technique for soft errors
ACM Transactions on Computer Systems (TOCS)
Exploiting memory soft redundancy for joint improvement of error tolerance and access efficiency
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Shoestring: probabilistic soft error reliability on the cheap
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Improving chip multiprocessor reliability through code replication
Computers and Electrical Engineering
Energy-efficient redundant execution for chip multiprocessors
Proceedings of the 20th symposium on Great lakes symposium on VLSI
Characterizing the soft error vulnerability of multicores running multithreaded applications
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A compiler-based infrastructure for fault-tolerant co-design
Proceedings of the 13th International Workshop on Software & Compilers for Embedded Systems
A cost effective approach for online error detection using invariant relationships
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Modeling soft errors for data caches and alleviating their effects on data reliability
Microprocessors & Microsystems
DAFT: decoupled acyclic fault tolerance
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic processors demand dynamic operating systems
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
Scheduling for energy efficiency and fault tolerance in hard real-time systems
Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Performance-asymmetry-aware scheduling for Chip Multiprocessors with static core coupling
Journal of Systems Architecture: the EUROMICRO Journal
On the exploitation of narrow-width values for improving register file reliability
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Microprocessors & Microsystems
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
Releasing efficient beta cores to market early
Proceedings of the 38th annual international symposium on Computer architecture
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
A self-checking hardware journal for a fault-tolerant processor architecture
International Journal of Reconfigurable Computing - Special issue on selected papers from the international workshop on reconfigurable communication-centric systems on chips (ReCoSoC' 2010)
Soft core based embedded systems in critical aerospace applications
Journal of Systems Architecture: the EUROMICRO Journal
Trade-offs in transient fault recovery schemes for redundant multithreaded processors
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Chameleon: operating system support for dynamic processors
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Encore: low-cost, fine-grained transient fault recovery
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
Resource-Driven optimizations for transient-fault detecting superscalar microarchitectures
ACSAC'05 Proceedings of the 10th Asia-Pacific conference on Advances in Computer Systems Architecture
Exploiting inactive rename slots for detecting soft errors
ARCS'10 Proceedings of the 23rd international conference on Architecture of Computing Systems
Quasi-static fault-tolerant scheduling schemes for energy-efficient hard real-time systems
Journal of Systems and Software
Efficient soft error protection for commodity embedded microprocessors using profile information
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Runtime asynchronous fault tolerance via speculation
Proceedings of the Tenth International Symposium on Code Generation and Optimization
A first-order mechanistic model for architectural vulnerability factor
Proceedings of the 39th Annual International Symposium on Computer Architecture
The Journal of Supercomputing
RISE: improving the streaming processors reliability against soft errors in gpgpus
Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Operating system support for redundant multithreading
Proceedings of the tenth ACM international conference on Embedded software
Dynamic transient fault detection and recovery for embedded processor datapaths
Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Time-Constraint-Aware Optimization of Assertions in Embedded Software
Journal of Electronic Testing: Theory and Applications
Who watches the watchmen? - protecting operating system reliability mechanisms
HotDep'12 Proceedings of the Eighth USENIX conference on Hot Topics in System Dependability
Detection and correction of silent data corruption for large-scale high-performance computing
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Dynamic code duplication with vulnerability awareness for soft error detection on VLIW architectures
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Warped-DMR: Light-weight Error Detection for GPGPU
MICRO-45 Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Low cost control flow protection using abstract control signatures
Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
Proceedings of the ACM International Conference on Computing Frontiers
FaulTM: error detection and recovery using hardware transactional memory
Proceedings of the Conference on Design, Automation and Test in Europe
A work-stealing scheduling framework supporting fault tolerance
Proceedings of the Conference on Design, Automation and Test in Europe
Reli: hardware/software checkpoint and recovery scheme for embedded processors
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
A survey of checker architectures
ACM Computing Surveys (CSUR)
Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Journal of Electronic Testing: Theory and Applications
A dynamic approach to tolerate soft errors
Cluster Computing
A low-power instruction replay mechanism for design of resilient microprocessors
ACM Transactions on Embedded Computing Systems (TECS)
Epipe: A low-cost fault-tolerance technique considering WCET constraints
Journal of Systems Architecture: the EUROMICRO Journal
Hi-index | 0.01 |
Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations.We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor—derived from a Simultaneous Multithreaded (SMT) processor—provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies of the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes it difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order. This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstracts both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms—slack fetch and branch outcome queue—that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with a maximum benefit of up to 29%.