SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Constructions of optical FIFO queues
IEEE/ACM Transactions on Networking (TON) - Special issue on networking and information theory
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Dynamo: amazon's highly available key-value store
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
Supporting superpage allocation without additional hardware support
Proceedings of the 7th international symposium on Memory management
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
The definitive guide to the xen hypervisor
The definitive guide to the xen hypervisor
Scalable high performance main memory system using phase-change memory technology
Proceedings of the 36th annual international symposium on Computer architecture
A study of a KVM-based cluster for grid computing
Proceedings of the 47th Annual Southeast Regional Conference
Error-correcting codes for semiconductor memory applications: a state-of-the-art review
IBM Journal of Research and Development
Use ECP, not ECC, for hard failures in resistive memories
Proceedings of the 37th annual international symposium on Computer architecture
ITC'94 Proceedings of the 1994 international conference on Test
DRAM errors in the wild: a large-scale field study
Communications of the ACM
The design of a practical system for fault-tolerant virtual machines
ACM SIGOPS Operating Systems Review
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
SAFER: Stuck-At-Fault Error Recovery for Memories
MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Algorithm-based recovery for iterative methods without checkpointing
Proceedings of the 20th international symposium on High performance distributed computing
Performance and energy modeling for live migration of virtual machines
Proceedings of the 20th international symposium on High performance distributed computing
Improving PCM Endurance with Randomized Address Remapping in Hybrid Memory System
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Poster: a tunable, software-based DRAM error detection and correction library for HPC
Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
Pay-As-You-Go: low-overhead hard-error correction for phase change memories
Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
A tunable, software-based DRAM error detection and correction library for HPC
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Singleton: system-wide page deduplication in virtual environments
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Fault tolerant parallel data-intensive algorithms
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Surviving failures in bandwidth-constrained datacenters
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
A study of DRAM failures in the field
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Memvisor: Application Level Memory Mirroring via Binary Translation
CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
Hi-index | 0.00 |
Today's commercial cloud service providers require the availability with an annual uptime percentage at least 99.95\%. While memory errors become norms instead of exceptions with the increasing memory's density and capacity in cloud applications. Thus, uncorrected errors from DRAM can be a significant source of system downtime. To address this increasingly important concern, both hardware and software memory mirroring technologies are studied nowadays to provide memory high availability. However, hardware solutions like mirror memory, which uses doubled chip, need dedicated and costly peripheral hardware. While existing software approaches, i.e., virtual machine's checkpoint technology, reduce the expense but incur the high overhead in practical usage. In this paper, we present a novel system called \emph{k}Memvisor to provide system-wide high availability memory mirroring. It is a software approach achieving flexible multi-granularity memory mirroring via virtualization and binary translation technology. Specifically, kMemvisor first creates backup space of the same size of the specified memory for applications or virtual machines. We can flexibly set memory areas to be mirrored or not mirrored from application level to system-wide. Then, all memory write instructions in the native memory space are captured and instrumented by mirror memory write instructions to synchronize the data in backup space. Furthermore, this instruction level memory synchronization reduces backup overhead and lowers the probability of data loss compared with traditional software approaches. So kMemvisor could use data from the backup space to recover when memory failures happen. The results show that kMemvisor causes 55% overhead in the worst case of system-wide high availability and 30% average for the real world applications, which outperforms the state-of-the-art software approaches even in the worst case.