The structure of System/88, a fault-tolerant computer
IBM Systems Journal
Hypervisor-based fault tolerance
ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
DIVA: a reliable substrate for deep submicron microarchitecture design
Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading
Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Transient-fault recovery using simultaneous multithreading
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Distributed Shared Memory: Concepts and Systems
IEEE Parallel & Distributed Technology: Systems & Technology
TFT: A Software System for Application-Transparent Fault Tolerance
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Transient-fault recovery for chip multiprocessors
Proceedings of the 30th annual international symposium on Computer architecture
Reliability, availability, and serviceability (RAS) of the IBM eServer z990
IBM Journal of Research and Development
Fingerprinting: bounding soft-error detection latency and bandwidth
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Cost-Effective Main Memory Organization for Future Servers
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
NonStop® Advanced Architecture
DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Virtual hierarchies to support server consolidation
Proceedings of the 34th annual international symposium on Computer architecture
Configurable isolation: building high availability systems with commodity multi-core processors
Proceedings of the 34th annual international symposium on Computer architecture
A study in Malloc: a case of excessive minor faults
ALS '01 Proceedings of the 5th annual Linux Showcase & Conference - Volume 5
Strongly Fault Secure Logic Networks
IEEE Transactions on Computers
Virtualized and flexible ECC for main memory
Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems
EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors
SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security
Hi-index | 0.00 |
High availability systems typically rely on redundant components and functionality to achieve fault detection, isolation and fail over. In the future, increases in error rates will make high availability important even in the commodity and volume market. Systems will be built out of chip multiprocessors (CMPs) with multiple identical components that can be configured to provide redundancy for high availability. However, the 100% overhead of making all components redundant is going to be unacceptable for the commodity market, especially when all applications might not require high availability. In particular, duplicating the entire memory like the current high availability systems (e.g. NonStop and Stratus) do is particularly problematic given the fact that system costs are going to be dominated by the cost of memory. In this paper, we propose a novel technique called a duplication cache to reduce the overhead of memory duplication in CMP-based high availability systems. A duplication cache is a reserved area of main memory that holds copies of pages belonging to the current write working set (set of actively modified pages) of running processes. All other pages are marked as read-only and are kept only as a single, shared copy. The size of the duplication cache can be configured dynamically at runtime and allows system designers to trade off the cost of memory duplication with minor performance overhead. We extensively analyze the effectiveness of our duplication cache technique and show that for a range of benchmarks memory duplication can be reduced by 60–90% with performance degradation ranging from 1–12%. On average, a duplication cache can reduce memory duplication by 60% for a performance overhead of 4% and by 90% for a performance overhead of 5%.