Attacking the process migration bottleneck
SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
Fine-grained mobility in the Emerald system
ACM Transactions on Computer Systems (TOCS)
Transparent process migration: design alternatives and the sprite implementation
Software—Practice & Experience
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
The performance of μ-kernel-based systems
Proceedings of the sixteenth ACM symposium on Operating systems principles
Preemptable remote execution facilities for the V-system
Proceedings of the tenth ACM symposium on Operating systems principles
Architectural requirements and scalability of the NAS parallel benchmarks
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
WMCSA '02 Proceedings of the Fourth IEEE Workshop on Mobile Computing Systems and Applications
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Critical event prediction for proactive management in large-scale computer clusters
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The design and implementation of Zap: a system for migrating computing environments
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Optimizing the migration of virtual computers
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
A Power-Aware Run-Time System for High-Performance Computing
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Availability Modeling and Analysis on High Performance Cluster Computing Systems
ARES '06 Proceedings of the First International Conference on Availability, Reliability and Security
Self-migration of operating systems
Proceedings of the 11th workshop on ACM SIGOPS European workshop
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Cooperative checkpointing: a robust approach to large-scale systems reliability
Proceedings of the 20th annual international conference on Supercomputing
A case for high performance computing with virtual machines
Proceedings of the 20th annual international conference on Supercomputing
Constructing services with interposable virtual hardware
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Optimizing network virtualization in Xen
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
High performance VMM-bypass I/O in virtual machines
ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Proactive fault tolerance in MPI applications via task migration
HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Virtual machine aware communication libraries for high performance computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Power-aware dynamic placement of HPC applications
Proceedings of the 22nd annual international conference on Supercomputing
Autonomous learning for efficient resource utilization of dynamic VM migration
Proceedings of the 22nd annual international conference on Supercomputing
The impact of paravirtualized memory hierarchy on linear algebra computational kernels and software
HPDC '08 Proceedings of the 17th international symposium on High performance distributed computing
Proactive process-level live migration in HPC environments
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 2nd workshop on System-level virtualization for high performance computing
A tunable holistic resiliency approach for high-performance computing systems
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
An evaluation of multiple communication interfaces for virtualized SMP clusters
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Proceedings of the 6th ACM conference on Computing frontiers
Live migration of virtual machine based on full system trace and replay
Proceedings of the 18th ACM international symposium on High performance distributed computing
Self-Tuning Virtual Machines for Predictable eScience
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Availability analysis of application servers using software rejuvenation and virtualization
Journal of Computer Science and Technology
HPVZ: A High Performance Virtual Computing Environment for Super Computers
APPT '09 Proceedings of the 8th International Symposium on Advanced Parallel Processing Technologies
A fault-tolerant strategy for virtualized HPC clusters
The Journal of Supercomputing
Cost of Virtual Machine Live Migration in Clouds: A Performance Evaluation
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Evaluating MapReduce on Virtual Machines: The Hadoop Case
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Journal of Parallel and Distributed Computing
Managing performance of aging applications via synchronized replica rejuvenation
DSOM'07 Proceedings of the Distributed systems: operations and management 18th IFIP/IEEE international conference on Managing virtualization of networks and services
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Piccolo: building fast, distributed programs with partitioned tables
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Rethink the virtual machine template
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Performance and energy modeling for live migration of virtual machines
Proceedings of the 20th international symposium on High performance distributed computing
Implementation of a green power management algorithm for virtual machines on cloud computing
UIC'11 Proceedings of the 8th international conference on Ubiquitous intelligence and computing
Optimized pre-copy live migration for memory intensive applications
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Performance evaluation of MapReduce using full virtualisation on a departmental cloud
International Journal of Applied Mathematics and Computer Science - SPECIAL SECTION: Efficient Resource Management for Grid-Enabled Applications
Job failures in high performance computing systems: A large-scale empirical study
Computers & Mathematics with Applications
Proactive process-level live migration and back migration in HPC environments
Journal of Parallel and Distributed Computing
CompSC: live migration with pass-through devices
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
Performance evaluation of HPC benchmarks on VMware's ESXi server
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing
Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
A hybrid local storage transfer scheme for live migration of I/O intensive workloads
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Data-driven fault tolerance for work stealing computations
Proceedings of the 26th ACM international conference on Supercomputing
Speculative Memory State Transfer for Active-Active Fault Tolerance
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A medical image file accessing system with virtualization fault tolerance on cloud
GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
Enhancing data center sustainability through energy-adaptive computing
ACM Journal on Emerging Technologies in Computing Systems (JETC)
On construction of cloud iaas for VM live migration using KVM and opennebula
ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part II
X10-FT: transparent fault tolerance for APGAS language and runtime
Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
Parallelizing live migration of virtual machines
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Configurable performance analysis and evaluation framework for cloud systems
International Journal of Information and Communication Technology
Autonomous massively multiplayer online game operation on unreliable resources
Proceedings of the International C* Conference on Computer Science and Software Engineering
Performance and energy modeling for live migration of virtual machines
Cluster Computing
On the use of a proportional-share market for application SLO support in clouds
Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Scheduling highly available applications on cloud environments
Future Generation Computer Systems
X10-FT: Transparent fault tolerance for APGAS language and runtime
Parallel Computing
Hi-index | 0.00 |
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.