DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
ED4I: Error Detection by Diverse Data and Duplicated Instructions
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Soft-Error Detection Using Control Flow Assertions
DFT '03 Proceedings of the 18th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems
The Soft Error Problem: An Architectural Perspective
HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Rescue: A Microarchitecture for Testability and Defect Tolerance
Proceedings of the 32nd annual international symposium on Computer Architecture
ReStore: Symptom Based Soft Error Detection in Microprocessors
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors
Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Software-controlled fault tolerance
ACM Transactions on Architecture and Code Optimization (TACO)
Ultra low-cost defect protection for microprocessor pipelines
Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Configurable isolation: building high availability systems with commodity multi-core processors
Proceedings of the 34th annual international symposium on Computer architecture
Circuit Failure Prediction and Its Application to Transistor Aging
VTS '07 Proceedings of the 25th IEEE VLSI Test Symmposium
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Automated Derivation of Application-aware Error Detectors using Static Analysis
IOLTS '07 Proceedings of the 13th IEEE International On-Line Testing Symposium
Algorithm-Based Fault Tolerance for Matrix Operations
IEEE Transactions on Computers
Watchdog Processors and Structural Integrity Checking
IEEE Transactions on Computers
Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design
IEEE Transactions on Computers
Penelope: The NBTI-Aware Processor
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Self-calibrating Online Wearout Detection
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Trading off Cache Capacity for Reliability to Enable Low Voltage Operation
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
IFRA: instruction footprint recording and analysis for post-silicon bug localization in processors
Proceedings of the 45th annual Design Automation Conference
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Sequential element design with built-in soft error resilience
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Architectural core salvaging in a multi-core processor for hard-error tolerance
Proceedings of the 36th annual international symposium on Computer architecture
Circuit techniques for dynamic variation tolerance
Proceedings of the 46th Annual Design Automation Conference
Vicis: a reliable network for unreliable silicon
Proceedings of the 46th Annual Design Automation Conference
The use of triple-modular redundancy to improve computer reliability
IBM Journal of Research and Development
Overcoming Early-Life Failure and Aging for Robust Systems
IEEE Design & Test
Vision for cross-layer optimization to address the dual challenges of energy and reliability
Proceedings of the Conference on Design, Automation and Test in Europe
Power Efficient Approaches to Redundant Multithreading
IEEE Transactions on Parallel and Distributed Systems
Cross-layer resilience challenges: metrics and optimization
Proceedings of the Conference on Design, Automation and Test in Europe
Cross-layer virtual observers for embedded multiprocessor system-on-chip (MPSoC)
Proceedings of the 11th International Workshop on Adaptive and Reflective Middleware
Reliability challenges for electric vehicles: from devices to architecture and systems software
Proceedings of the 50th Annual Design Automation Conference
Improving the fault resilience of an H.264 decoder using static analysis methods
ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
Hi-index | 0.00 |
Current electronic systems implement reliability using only a few layers of the system stack, which simplifies the design of other layers but is becoming increasingly expensive over time. In contrast, cross-layer resilient systems, which distribute the responsibility for tolerating errors, device variation, and aging across the system stack, have the potential to provide the resilience required to implement reliable, high-performance, low-power systems in future fabrication processes at significantly lower cost. These systems can implement less-frequent resilience tasks in software to save power and chip area, can tune their reliability guarantees to the needs of applications, and can use the information available at each level in the system stack to optimize performance and power consumption. In this paper, we outline an approach to cross-layer system design that describes resilience as a set of tasks that systems must perform in order to detect and tolerate errors and variation. We then present strawman examples of how this task-based design process could be used to implement general-purpose computing and SoC systems, drawing on previous work and identifying key areas for future research.