Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Reconfigurable caches and their application to media processing
Proceedings of the 27th annual international symposium on Computer architecture
Piranha: a scalable architecture based on single-chip multiprocessing
Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Detailed design and evaluation of redundant multithreading alternatives
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Dual use of superscalar datapath for transient-fault detection and recovery
Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture
The Future of Systems Research
Computer
IBM's S/390 G5 Microprocessor Design
IEEE Micro
Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling
Proceedings of the 30th annual international symposium on Computer architecture
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Exploiting Microarchitectural Redundancy For Defect Tolerance
ICCD '03 Proceedings of the 21st International Conference on Computer Design
The Case for Lifetime Reliability-Aware Microprocessors
Proceedings of the 31st annual international symposium on Computer architecture
Tolerating Hard Faults in Microprocessor Array Structures
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
The Impact of Technology Scaling on Lifetime Reliability
DSN '04 Proceedings of the 2004 International Conference on Dependable Systems and Networks
Reliability, availability, and serviceability (RAS) of the IBM eServer z990
IBM Journal of Research and Development
Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
Exploiting Structural Duplication for Lifetime Reliability Enhancement
Proceedings of the 32nd annual international symposium on Computer Architecture
NonStop® Advanced Architecture
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
POWER4 system microarchitecture
IBM Journal of Research and Development
StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Mixed-mode multicore reliability
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Implementing high availability memory with a duplication cache
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Efficient unicast and multicast support for CMPs
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
The StageNet fabric for constructing resilient multicore systems
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Dynamic heterogeneity and the need for multicore virtualization
ACM SIGOPS Operating Systems Review
Comparison of enterprise service buses based on their support of high availability
Proceedings of the 2010 ACM Symposium on Applied Computing
Energy-efficient redundant execution for chip multiprocessors
Proceedings of the 20th symposium on Great lakes symposium on VLSI
Adaptive online testing for efficient hard fault detection
ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Necromancer: enhancing system throughput by animating dead cores
Proceedings of the 37th annual international symposium on Computer architecture
A conceptual model of self-monitoring multi-core systems
Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research
Scalable thread scheduling and global power management for heterogeneous many-core architectures
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Dynamic processors demand dynamic operating systems
HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism
Design techniques for cross-layer resilience
Proceedings of the Conference on Design, Automation and Test in Europe
Multiplexed redundant execution: a technique for efficient fault tolerance in chip multiprocessors
Proceedings of the Conference on Design, Automation and Test in Europe
Erasing Core Boundaries for Robust and Configurable Performance
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
A case for secure and scalable hypervisor using safe language
Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores
Chameleon: operating system support for dynamic processors
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Reliable computing with ultra-reduced instruction set co-processors
Proceedings of the 49th Annual Design Automation Conference
A novel NoC-based design for fault-tolerance of last-level caches in CMPs
Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Heuristic search for adaptive, defect-tolerant multiprocessor arrays
ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems
Journal of Electronic Testing: Theory and Applications
Self-Adaptive Fault Tolerance in Multi-/Many-Core Systems
Journal of Electronic Testing: Theory and Applications
Low cost permanent fault detection using ultra-reduced instruction set co-processors
Proceedings of the Conference on Design, Automation and Test in Europe
A new paradigm for trading off yield, area and performance to enhance performance per wafer
Proceedings of the Conference on Design, Automation and Test in Europe
Deconfigurable microprocessor architectures for silicon debug acceleration
Proceedings of the 40th Annual International Symposium on Computer Architecture
An adaptive approach for online fault management in many-core architectures
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
Scheduling highly available applications on cloud environments
Future Generation Computer Systems
NoC-based fault-tolerant cache design in chip multiprocessors
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of identical resources like cores, cache and interconnection networks would appear to be ideal building blocks for implementing high availability solutions on chip. However, doing so poses significant challenges with respect to error containment and faulty component replacement. Increasing silicon and transient fault rates with future technology scaling exacerbate the problem. This paper proposes a novel, cost-effective, architecture for high availability systems built from future multi-core processors. We propose a new chip multiprocessor architecture that provides configurable isolation for fault containment and component retirement, based upon cost-effective modifications to commodity designs. The design is evaluated for a state-of-the-art industrial fault model and the proposed architecture is shown to provide effective fault isolation and graceful degradation even when the failure rate is high.