Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Memory resource management in VMware ESX server
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Optimizing the migration of virtual computers
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Scalability, fidelity, and containment in the potemkin virtual honeyfarm
Proceedings of the twentieth ACM symposium on Operating systems principles
A case for high performance computing with virtual machines
Proceedings of the 20th annual international conference on Supercomputing
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Virtual Clusters on the Fly - Fast, Scalable, and Flexible Installation
CCGRID '07 Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid
Subtleties in tolerating correlated failures in wide-area storage systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Proactive fault tolerance for HPC with Xen virtualization
Proceedings of the 21st annual international conference on Supercomputing
Memory buddies: exploiting page sharing for smart colocation in virtualized data centers
Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
The case for RAMClouds: scalable high-performance storage entirely in DRAM
ACM SIGOPS Operating Systems Review
Difference engine: harnessing memory redundancy in virtual machines
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Satori: enlightened page sharing
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Live gang migration of virtual machines
Proceedings of the 20th international symposium on High performance distributed computing
VMFlock: virtual machine co-migration for the cloud
Proceedings of the 20th international symposium on High performance distributed computing
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Exploiting Data Similarity to Reduce Memory Footprints
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
An empirical study of memory sharing in virtual machines
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
Evaluating the feasibility of using memory content similarity to improve system resilience
Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers
CMD: classification-based memory deduplication through page access characteristics
Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Hi-index | 0.00 |
In virtualized large-scale parallel systems scientific workloads consist of numerous processes running across many virtual nodes. Their memory footprint is massive, and this has consequences for services that enhance performance, reliability, or power. We argue that a service that dynamically tracks the sharing of memory content, both within individual nodes, and across nodes, can simplify and enhance the implementation of such services. For example, leveraging content sharing could significantly reduce the size of a checkpoint of a group of nodes. As another example, it could speed VM migration by allowing the reconstruction of a VM's memory from multiple source VMs. Finally, a service that improves reliability by introducing memory redundancy could leverage existing content sharing to minimize the memory costs of any particular level of redundancy. We argue that both intra- and inter-node memory content sharing is common in parallel applications, supporting this claim by a detailed study of both kinds of sharing, at different scales, different granularities, and different times for a range of applications and application benchmarks. We then describe the high level approach we are taking to design and implement a distributed, VMM-based system that can efficiently and scalably identify and track such sharing with low overhead.