Bugs as deviant behavior: a general approach to inferring errors in systems code
SOSP '01 Proceedings of the eighteenth ACM symposium on Operating systems principles
Tracking down software bugs using automatic anomaly detection
Proceedings of the 24th International Conference on Software Engineering
Performance and scalability of EJB applications
OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Lessons from Giant-Scale Services
IEEE Internet Computing
Fault Injection Techniques and Tools
Computer
The Vision of Autonomic Computing
Computer
The Design of the POSTGRES Storage System
VLDB '87 Proceedings of the 13th International Conference on Very Large Data Bases
Reducing Recovery Time in a Small Recursively Restartable System
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
An Experimental Evaluation of the REE SIFT Environment for Spaceborne Applications
DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
Automatic Failure-Path Inference: A Generic Introspection Technique for Internet Applications
WIAPP '03 Proceedings of the The Third IEEE Workshop on Internet Applications
Automatic alarm correlation for fault identification
INFOCOM '95 Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communication Societies (Vol. 2)-Volume - Volume 2
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Performance debugging for distributed systems of black boxes
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic Performance Management in Component Based Software Systems
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
Failure Diagnosis Using Decision Trees
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
OnCall: Defeating Spikes with a Free-Market Application Cluster
ICAC '04 Proceedings of the First International Conference on Autonomic Computing
HOTOS'03 Proceedings of the 9th conference on Hot Topics in Operating Systems - Volume 9
Session state: beyond soft state
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Exploring failure transparency and the limits of generic recovery
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Scalable, distributed data structures for internet service construction
OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
IBM S/390 parallel enterprise server G5 fault tolerance: a historical perspective
IBM Journal of Research and Development
Polyglot: an extensible compiler framework for Java
CC'03 Proceedings of the 12th international conference on Compiler construction
High speed and robust event correlation
IEEE Communications Magazine
Detecting application-level failures in component-based Internet services
IEEE Transactions on Neural Networks
Combining statistical monitoring and predictable recovery for self-management
WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering
Towards an autonomic computing testbed
HotAC II Hot Topics in Autonomic Computing on Hot Topics in Autonomic Computing
Active Diagnosis of High-Level Faults in Distributed Internet Services
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Self-adaptive software: Landscape and research challenges
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
VCONF: a reinforcement learning approach to virtual machines auto-configuration
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
Root-cause analysis of performance anomalies in web-based applications
Proceedings of the 2011 ACM Symposium on Applied Computing
A self-healing component sandbox for untrustworthy third party code execution
CBSE'10 Proceedings of the 13th international conference on Component-Based Software Engineering
Performance troubleshooting in data centers: an annotated bibliography?
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recovering from transient and intermittent software failures, without requiring application modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection and localization, microrebooting for rapid recovery, and external management of recovery actions. The individual techniques are autonomous and work across a wide range of componentized Internet applications, making them well-suited to the rapidly changing software of Internet services. The proposed framework has been integrated with JBoss, an open-source J2EE application server. Our prototype provides an execution platform that can automatically recover J2EE applications within seconds of the manifestation of a fault. Our system can provide a subset of a system's active end users with the illusion of continuous uptime, in spite of failures occurring behind the scenes, even when there is no functional redundancy in the system.