Scale and performance in a distributed file system
ACM Transactions on Computer Systems (TOCS)
The Sprite Network Operating System
Computer
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Accent: A communication oriented network operating system kernel
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
The design and implementation of Zap: a system for migrating computing environments
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
The Impact of Recovery Mechanisms on the Likelihood of Saving Corrupted State
ISSRE '02 Proceedings of the 13th International Symposium on Software Reliability Engineering
A "flight data recorder" for enabling full-system multiprocessor deterministic replay
Proceedings of the 30th annual international symposium on Computer architecture
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Optimizing Checkpoint Sizes in the C3 System
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
ReVirt: enabling intrusion analysis through virtual-machine logging and replay
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Speculative execution in a distributed file system
Proceedings of the twentieth ACM symposium on Operating systems principles
To infinity and beyond: time warped network emulation
Proceedings of the twentieth ACM symposium on Operating systems principles
TRAP-Array: A Disk Array Architecture Providing Timely Recovery to Any Point-in-time
Proceedings of the 33rd annual international symposium on Computer Architecture
Debugging operating systems with time-traveling virtual machines
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
Secondsite: disaster protection for the common server
HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Parallax: managing storage for a million machines
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Live wide-area migration of virtual machines including local persistent state
Proceedings of the 3rd international conference on Virtual execution environments
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Execution replay for intrusion analysis
Execution replay for intrusion analysis
Parallax: virtual disks for virtual machines
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Virtual routers on the move: live router migration as a network-management primitive
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Fast, inexpensive content-addressed storage in foundation
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Post-copy based live virtual machine migration using adaptive pre-paging and dynamic self-ballooning
Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
SnowFlock: rapid virtual machine cloning for cloud computing
Proceedings of the 4th ACM European conference on Computer systems
First-aid: surviving and preventing memory management bugs during production runs
Proceedings of the 4th ACM European conference on Computer systems
Transparent checkpoints of closed distributed systems in Emulab
Proceedings of the 4th ACM European conference on Computer systems
Adding the easy button to the cloud with SnowFlock and MPI
Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing
Practical and low-overhead masking of failures of TCP-based servers
ACM Transactions on Computer Systems (TOCS)
Live migration of virtual machine based on full system trace and replay
Proceedings of the 18th ACM international symposium on High performance distributed computing
Tolerating latency in replicated state machines through client speculation
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Post-copy live migration of virtual machines
ACM SIGOPS Operating Systems Review
An empirical study of high availability in stream processing systems
Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
Using virtualization for high availability and disaster recovery
IBM Journal of Research and Development
Designing and embedding reliable virtual infrastructures
Proceedings of the second ACM SIGCOMM workshop on Virtualized infrastructure systems and architectures
Augmented smartphone applications through clone cloud execution
HotOS'09 Proceedings of the 12th conference on Hot topics in operating systems
Disaster recovery as a cloud service: economic benefits & deployment challenges
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
OverCourt: DDoS mitigation through credit-based traffic segregation and path migration
Computer Communications
RAID'10 Proceedings of the 13th international conference on Recent advances in intrusion detection
The design of a practical system for fault-tolerant virtual machines
ACM SIGOPS Operating Systems Review
Storyboard: optimistic deterministic multithreading
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
HotDep'10 Proceedings of the Sixth international conference on Hot topics in system dependability
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
SnowFlock: Virtual Machine Cloning as a First-Class Cloud Primitive
ACM Transactions on Computer Systems (TOCS)
Lightweight live migration for high availability cluster service
SSS'10 Proceedings of the 12th international conference on Stabilization, safety, and security of distributed systems
Simultaneous logging and replay for recording evidences of system failures
SEUS'10 Proceedings of the 8th IFIP WG 10.2 international conference on Software technologies for embedded and ubiquitous systems
ICDCN'11 Proceedings of the 12th international conference on Distributed computing and networking
Rethink the virtual machine template
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fast and space-efficient virtual machine checkpointing
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
CloudNet: dynamic pooling of cloud resources by live WAN migration of virtual machines
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Overdriver: handling memory overload in an oversubscribed cloud
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Live replication of virtual machines
SEPADS'11 Proceedings of the 10th WSEAS international conference on Software engineering, parallel and distributed systems
Jump-start cloud: efficient deployment framework for large-scale cloud applications
ICDCIT'11 Proceedings of the 7th international conference on Distributed computing and internet technology
ZZ and the art of practical BFT execution
Proceedings of the sixth conference on Computer systems
Operating system support for application-specific speculation
Proceedings of the sixth conference on Computer systems
Designing and embedding reliable virtual infrastructures
ACM SIGCOMM Computer Communication Review
The inherent difficulty of timely primary-backup replication
Proceedings of the 30th annual ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Safe side effects commitment for OS-level virtualization
Proceedings of the 8th ACM international conference on Autonomic computing
PipeCloud: using causality to overcome speed-of-light delays in cloud-based disaster recovery
Proceedings of the 2nd ACM Symposium on Cloud Computing
Breaking up is hard to do: security and functionality in a commodity hypervisor
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting failures in distributed systems with the Falcon spy network
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
An extensible framework for grid information service virtualized process
Proceedings of the 2011 International Conference on Innovative Computing and Cloud Computing
Differentiated Availability in Cloud Computing SLAs
GRID '11 Proceedings of the 2011 IEEE/ACM 12th International Conference on Grid Computing
ReNIC: Architectural extension to SR-IOV I/O virtualization for efficient replication
ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Experimental evaluation of software aging effects on the eucalyptus cloud computing infrastructure
Proceedings of the Middleware 2011 Industry Track Workshop
An evaluation framework for highly available and scalable SIP server clusters
IPTcomm '11 Proceedings of the 5th International Conference on Principles, Systems and Applications of IP Telecommunications
Enhancing TCP throughput of highly available virtual machines via speculative communication
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
SecondSite: disaster tolerance as a service
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
CompSC: live migration with pass-through devices
VEE '12 Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments
The Xen-Blanket: virtualize once, run everywhere
Proceedings of the 7th ACM european conference on Computer Systems
HotCloud'11 Proceedings of the 3rd USENIX conference on Hot topics in cloud computing
Proceedings of the Seventh Annual Workshop on Cyber Security and Information Intelligence Research
Enhancing the performance of high availability lightweight live migration
OPODIS'11 Proceedings of the 15th international conference on Principles of Distributed Systems
High availability on cloud with HA-OSCAR
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Speculative Memory State Transfer for Active-Active Fault Tolerance
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Gnothi: separating data and metadata for efficient and available storage replication
USENIX ATC'12 Proceedings of the 2012 USENIX conference on Annual Technical Conference
High performance network virtualization with SR-IOV
Journal of Parallel and Distributed Computing
All about Eve: execute-verify replication for multi-core servers
OSDI'12 Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation
Scalable and Elastic Telecommunication Services in the Cloud
Bell Labs Technical Journal
The Inherent Difficulty of Timely Primary-Backup Replication
Bell Labs Technical Journal
Jump-start cloud: efficient deployment framework for large-scale cloud applications
Concurrency and Computation: Practice & Experience
MemRed: towards reliable web applications
Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Towards dependable clients: improving the reliability and availability of the browsers
Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Improving disk I/O performance in a virtualized system
Journal of Computer and System Sciences
RemusDB: transparent high availability for database systems
The VLDB Journal — The International Journal on Very Large Data Bases
Future Generation Computer Systems
Cyrus: unintrusive application-level record-replay for replay parallelism
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Efficient live migration of virtual machines using shared storage
Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
kMemvisor: flexible system wide memory mirroring in virtual environments
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Composing OS extensions safely and efficiently with Bascule
Proceedings of the 8th ACM European Conference on Computer Systems
Tradeoffs in compressing virtual machine checkpoints
Proceedings of the 7th international workshop on Virtualization technologies in distributed computing
Yank: enabling green data centers to pull the plug
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
HSG-LM: hybrid-copy speculative guest OS live migration without hypervisor
Proceedings of the 6th International Systems and Storage Conference
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Escape capsule: explicit state is robust and scalable
HotOS'13 Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems
A survey of migration mechanisms of virtual machines
ACM Computing Surveys (CSUR)
Guide-copy: fast and silent migration of virtual machine for datacenters
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
COLO: COarse-grained LOck-stepping virtual machines for non-stop service
Proceedings of the 4th annual Symposium on Cloud Computing
Pico replication: a high availability framework for middleboxes
Proceedings of the 4th annual Symposium on Cloud Computing
MiG: efficient migration of desktop VMs using semantic compression
USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
Design and implementation of a trusted monitoring framework for cloud platforms
Future Generation Computer Systems
Software aging in the eucalyptus cloud computing infrastructure: Characterization and rejuvenation
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Model-based high availability configuration framework for cloud
Proceedings of the 2013 Middleware Doctoral Symposium
Underprovisioning backup power infrastructure for datacenters
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
RelaxReplay: record and replay for relaxed-consistency multiprocessors
Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
VMShadow: optimizing the performance of latency-sensitive virtual desktops in distributed clouds
Proceedings of the 5th ACM Multimedia Systems Conference
On improving the dependability of cloud applications with fault-tolerance
Proceedings of the WICSA 2014 Companion Volume
Approximating the Response Time Distribution of Fault-Tolerant Multi-tier Cloud Services
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
Allowing applications to survive hardware failure is an expensive undertaking, which generally involves reengineering software to include complicated recovery logic as well as deploying special-purpose hardware; this represents a severe barrier to improving the dependability of large or legacy applications. We describe the construction of a general and transparent high availability service that allows existing, unmodified software to be protected from the failure of the physical machine on which it runs. Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections. Our approach encapsulates protected software in a virtual machine, asynchronously propagates changed state to a backup host at frequencies as high as forty times a second, and uses speculative execution to concurrently run the active VM slightly ahead of the replicated system state.