Communications of the ACM
Performance and scalability of EJB applications
OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Lessons from Giant-Scale Services
IEEE Internet Computing
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Recursive Restartability: Turning the Reboot Sledgehammer into a Scalpel
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Terra: a virtual machine-based platform for trusted computing
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Farsite: federated, available, and reliable storage for an incompletely trusted environment
OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Recovering Internet Service Sessions from Operating System Failures
IEEE Internet Computing
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Research challenges of autonomic computing
Proceedings of the 27th international conference on Software engineering
Exploring the acceptability envelope
OOPSLA '05 Companion to the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
Proactive operating system recovery
Proceedings of the twentieth ACM symposium on Operating systems principles
The costs and limits of availability for replicated services
ACM Transactions on Computer Systems (TOCS)
An Integrated Framework for Dependable and Revivable Architectures Using Multicore Processors
Proceedings of the 33rd annual international symposium on Computer Architecture
Stabilizers: a modular checkpointing abstraction for concurrent functional programs
Proceedings of the eleventh ACM SIGPLAN international conference on Functional programming
EOS2: unstoppable stateful PHP
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Declarative failure recovery for sensor networks
Proceedings of the 6th international conference on Aspect-oriented software development
Reboot-based self-healing strategies for service-oriented systems
ACM-SE 45 Proceedings of the 45th annual southeast regional conference
Falling off the cliff: when systems go nonlinear
HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
The collective: a cache-based system management architecture
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
OPUS: online patches and updates for security
SSYM'05 Proceedings of the 14th conference on USENIX Security Symposium - Volume 14
Detecting performance anomalies in global applications
WORLDS'05 Proceedings of the 2nd conference on Real, Large Distributed Systems - Volume 2
Modular Checkpointing for Atomicity
Electronic Notes in Theoretical Computer Science (ENTCS)
Sealing OS processes to improve dependability and safety
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
An approach to detecting failures automatically
Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
Towards design for self-healing
Fourth international workshop on Software quality assurance: in conjunction with the 6th ESEC/FSE joint meeting
SafeDrive: safe and recoverable extensions using language-based techniques
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Improving dependability by revisiting operating system design
HotDep'07 Proceedings of the 3rd workshop on on Hot Topics in System Dependability
Reengineering J2EE Servers for Automated Management in Distributed Environments
IEEE Distributed Systems Online
Controlled, systematic, and efficient code replacement for running java programs
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
ACM SIGCOMM Computer Communication Review
Recoverable class loaders for a fast restart of Java applications
Proceedings of the 1st international conference on MOBILe Wireless MiddleWARE, Operating Systems, and Applications
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Exploring recovery from operating system lockups
ATC'07 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
EIO: error handling is occasionally correct
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Self-healing by means of automatic workarounds
Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems
Companion of the 30th international conference on Software engineering
LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages
ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
An Operating System Architecture for Future Information Appliances
SEUS '08 Proceedings of the 6th IFIP WG 10.2 international workshop on Software Technologies for Embedded and Ubiquitous Systems
A self-stabilizing autonomic recoverer for eventual Byzantine software
Journal of Systems and Software
Enforcing authorization policies using transactional memory introspection
Proceedings of the 15th ACM conference on Computer and communications security
Automatic workarounds as failure recoveries
Proceedings of the 2008 Foundations of Software Engineering Doctoral Symposium
Recoverable class loaders for a fast restart of Java applications
Mobile Networks and Applications
Recovery domains: an organizing principle for recoverable operating systems
Proceedings of the 14th international conference on Architectural support for programming languages and operating systems
Utility-driven proactive management of availability in enterprise-scale information flows
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
First-aid: surviving and preventing memory management bugs during production runs
Proceedings of the 4th ACM European conference on Computer systems
A systematic approach to system state restoration during storage controller micro-recovery
FAST '09 Proccedings of the 7th conference on File and storage technologies
An adaptation framework enabling resource-efficient operation of software systems
Proceedings of the Warm Up Workshop for ACM/IEEE ICSE 2010
Building a self-healing embedded system in a multi-OS environment
Proceedings of the 2009 ACM symposium on Applied Computing
Self-recovery in server programs
Proceedings of the 2009 international symposium on Memory management
In-field healing of integration problems with COTS components
ICSE '09 Proceedings of the 31st International Conference on Software Engineering
Automatic Generation of Runtime Failure Detectors from Property Templates
Software Engineering for Self-Adaptive Systems
Software Engineering for Self-Adaptive Systems
Towards Dynamic Component Isolation in a Service Oriented Platform
CBSE '09 Proceedings of the 12th International Symposium on Component-Based Software Engineering
Surviving sensor network software faults
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
Software rejuvenation in embedded systems
Journal of Automata, Languages and Combinatorics
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Optimizing crash dump in virtualized environments
Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Otherworld: giving applications a chance to survive OS kernel crashes
Proceedings of the 5th European conference on Computer systems
Business process monitoring for dependability
Architecting dependable systems IV
Monalytics: online monitoring and analytics for managing large scale data centers
Proceedings of the 7th international conference on Autonomic computing
On the potential of software rejuvenation for long-running sensor network deployments
Proceedings of the 2010 ICSE Workshop on Software Engineering for Sensor Network Applications
ACM Transactions on Computer Systems (TOCS)
Adaptive system anomaly prediction for large-scale hosting infrastructures
Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing
Membrane: Operating system support for restartable file systems
ACM Transactions on Storage (TOS)
IBM Journal of Research and Development
Lightweight checkpointing for concurrent ml
Journal of Functional Programming
Membrane: operating system support for restartable file systems
FAST'10 Proceedings of the 8th USENIX conference on File and storage technologies
CuriOS: improving reliability through operating system structure
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Deadlock immunity: enabling systems to defend against deadlocks
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Fido: fast inter-virtual-machine communication for enterprise appliances
USENIX'09 Proceedings of the 2009 conference on USENIX Annual technical conference
Runtime verification in context: can optimizing error detection improve fault diagnosis?
RV'10 Proceedings of the First international conference on Runtime verification
ReHype: enabling VM survival across hypervisor failures
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fast and correct performance recovery of operating systems using a virtual machine monitor
Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Applying dependability aspects on top of "aspectized" software layers
Proceedings of the tenth international conference on Aspect-oriented software development
Proceedings of the sixth conference on Computer systems
Operating system implications of fast, cheap, non-volatile memory
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Making programs forget: enforcing lifetime for sensitive data
HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
Quarantine: fault tolerance for concurrent servers with data-driven selective isolation
HotPar'11 Proceedings of the 3rd USENIX conference on Hot topic in parallelism
Using feature locality: can we leverage history to avoid failures during reconfiguration?
Proceedings of the 8th workshop on Assurances for self-adaptive systems
Architecture-based run-time fault diagnosis
ECSA'11 Proceedings of the 5th European conference on Software architecture
Breaking up is hard to do: security and functionality in a commodity hypervisor
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting failures in distributed systems with the Falcon spy network
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Detecting and surviving data races using complementary schedules
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Unstoppable stateful PHP web services
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Utility-driven proactive management of availability in enterprise-scale information flows
Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Autonomic agents for survivable security systems
EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Exception handling in the choices operating system
Advanced Topics in Exception Handling Techniques
Modeling and cost analysis of nested software rejuvenation policy
ICNC'05 Proceedings of the First international conference on Advances in Natural Computation - Volume Part III
ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
What to do when things go wrong: recovery in complex (computer) systems
Proceedings of the 11th annual international conference on Aspect-oriented Software Development Companion
A self-healing component sandbox for untrustworthy third party code execution
CBSE'10 Proceedings of the 13th international conference on Component-Based Software Engineering
Can dynamic provisioning and rejuvenation systems coexist in peace?
DSOM'05 Proceedings of the 16th IFIP/IEEE Ambient Networks international conference on Distributed Systems: operations and Management
Journal of Systems and Software
Themis: an I/O-efficient MapReduce
Proceedings of the Third ACM Symposium on Cloud Computing
MemRed: towards reliable web applications
Proceedings of the Workshop on Secure and Dependable Middleware for Cloud Monitoring and Management
Towards dependable clients: improving the reliability and availability of the browsers
Proceedings of the 9th Middleware Doctoral Symposium of the 13th ACM/IFIP/USENIX International Middleware Conference
Report on the fourth workshop on hot topics in software upgrades (HotSWUp 2012)
ACM SIGOPS Operating Systems Review
VScope: middleware for troubleshooting time-sensitive data center applications
Proceedings of the 13th International Middleware Conference
Quantitative survivability evaluation of three virtual machine-based server architectures
Journal of Network and Computer Applications
A comparative experimental study of software rejuvenation overhead
Performance Evaluation
ConAir: featherweight concurrency bug recovery via single-threaded idempotent execution
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Optimizing decomposition of software architecture for local recovery
Software Quality Control
Application level ballooning for efficient server consolidation
Proceedings of the 8th ACM European Conference on Computer Systems
Improving availability in distributed systems with failure informers
nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
Automatic recovery from runtime failures
Proceedings of the 2013 International Conference on Software Engineering
A framework for self-healing software systems
Proceedings of the 2013 International Conference on Software Engineering
Diagnosing architectural run-time failures
Proceedings of the 8th International Symposium on Software Engineering for Adaptive and Self-Managing Systems
Proceedings of the 17th Conference on Pattern Languages of Programs
Autonomous, failure-resilient orchestration of distributed discrete event simulations
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
ACM SIGOPS 24th Symposium on Operating Systems Principles
VirtuOS: an operating system with kernel virtualization
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Exception handlers for healing component-based systems
ACM Transactions on Software Engineering and Methodology (TOSEM) - Testing, debugging, and error handling, formal methods, lifecycle concerns, evolution and maintenance
A survey of software aging and rejuvenation studies
ACM Journal on Emerging Technologies in Computing Systems (JETC) - Special Issue on Reliability and Device Degradation in Emerging Technologies and Special Issue on WoSAR 2011
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Elon: Enabling efficient and long-term reprogramming for wireless sensor networks
ACM Transactions on Embedded Computing Systems (TECS)
Toward predictable, efficient, system-level tolerance of transient faults
ACM SIGBED Review - Special Issue on the 5th Workshop on Adaptive and Reconfigurable Embedded Systems
HARDFS: hardening HDFS with selective and lightweight versioning
FAST'13 Proceedings of the 11th USENIX conference on File and Storage Technologies
Hi-index | 0.00 |
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we use separation of process recovery from data recovery to enable microrebooting - a fine-grain technique for surgically recovering faulty application components, without disturbing the rest of the application. We evaluate microrebooting in an Internet auction system running on an application server. Microreboots recover most of the same failures as full reboots, but do so an order of magnitude faster and result in an order of magnitude savings in lost work. This cheap form of recovery engenders a new approach to high availability: microreboots can be employed at the slightest hint of failure, prior to node failover in multi-node clusters, even when mistakes in failure detection are likely; failure and recovery can be masked from end users through transparent call-level retries; and systems can be rejuvenated by parts, without ever being shut down.