Implementing high availability memory with a duplication cache

Authors:
Nidhi Aggarwal;James E. Smith;Kewal K. Saluja;Norman P. Jouppi;Parthasarathy Ranganathan
Affiliations:
University of Wisconsin-Madison, USA;University of Wisconsin-Madison, USA;University of Wisconsin-Madison, USA;Hewlett Packard Labs, Palo Alto, California, USA;Hewlett Packard Labs, Palo Alto, California, USA
Venue:
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
Year:
2008

Citing 23
Cited 4

The structure of System/88, a fault-tolerant computer

IBM Systems Journal
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Hypervisor-based fault tolerance

ACM Transactions on Computer Systems (TOCS) - Special issue on operating system principles
DIVA: a reliable substrate for deep submicron microarchitecture design

Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture
Transient fault detection via simultaneous multithreading

Proceedings of the 27th annual international symposium on Computer architecture
Slipstream processors: improving both performance and fault tolerance

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Transient-fault recovery using simultaneous multithreading

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Distributed Shared Memory: Concepts and Systems

IEEE Parallel & Distributed Technology: Systems & Technology
The AMD Opteron Processor for Multiprocessor Servers

IEEE Micro
TFT: A Software System for Application-Transparent Fault Tolerance

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Transient-fault recovery for chip multiprocessors

Proceedings of the 30th annual international symposium on Computer architecture
Reliability, availability, and serviceability (RAS) of the IBM eServer z990

IBM Journal of Research and Development
Fingerprinting: bounding soft-error detection latency and bandwidth

ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
A Cost-Effective Main Memory Organization for Future Servers

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
NonStop® Advanced Architecture

DSN '05 Proceedings of the 2005 International Conference on Dependable Systems and Networks
Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation

IEEE Micro
Virtual hierarchies to support server consolidation

Proceedings of the 34th annual international symposium on Computer architecture
Configurable isolation: building high availability systems with commodity multi-core processors

Proceedings of the 34th annual international symposium on Computer architecture
A study in Malloc: a case of excessive minor faults

ALS '01 Proceedings of the 5th annual Linux Showcase & Conference - Volume 5
Isolation in Commodity Multicore Processors

Computer
Strongly Fault Secure Logic Networks

IEEE Transactions on Computers

Virtualized and flexible ECC for main memory

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
A rising tide lifts all boats: how memory error prediction and prevention can help with virtualized system longevity

HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
An efficient, dynamically adaptive method to tolerate transient faults in multi-core systems

EWDC '11 Proceedings of the 13th European Workshop on Dependable Computing
A fault-tolerant, dynamically scheduled pipeline structure for chip multiprocessors

SAFECOMP'11 Proceedings of the 30th international conference on Computer safety, reliability, and security

Quantified Score

Hi-index	0.00

Visualization

Abstract

High availability systems typically rely on redundant components and functionality to achieve fault detection, isolation and fail over. In the future, increases in error rates will make high availability important even in the commodity and volume market. Systems will be built out of chip multiprocessors (CMPs) with multiple identical components that can be configured to provide redundancy for high availability. However, the 100% overhead of making all components redundant is going to be unacceptable for the commodity market, especially when all applications might not require high availability. In particular, duplicating the entire memory like the current high availability systems (e.g. NonStop and Stratus) do is particularly problematic given the fact that system costs are going to be dominated by the cost of memory. In this paper, we propose a novel technique called a duplication cache to reduce the overhead of memory duplication in CMP-based high availability systems. A duplication cache is a reserved area of main memory that holds copies of pages belonging to the current write working set (set of actively modified pages) of running processes. All other pages are marked as read-only and are kept only as a single, shared copy. The size of the duplication cache can be configured dynamically at runtime and allows system designers to trade off the cost of memory duplication with minor performance overhead. We extensively analyze the effectiveness of our duplication cache technique and show that for a range of benchmarks memory duplication can be reduced by 60–90% with performance degradation ranging from 1–12%. On average, a duplication cache can reduce memory duplication by 60% for a performance overhead of 4% and by 90% for a performance overhead of 5%.