The Vision of Autonomic Computing
Computer
Using process technology to control and coordinate software adaptation
Proceedings of the 25th International Conference on Software Engineering
Reliability Mechanisms for Very Large Storage Systems
MSS '03 Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03)
A "flight data recorder" for enabling full-system multiprocessor deterministic replay
Proceedings of the 30th annual international symposium on Computer architecture
A characterization of the sensitivity of query optimization to storage access cost parameters
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Improving the reliability of commodity operating systems
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Automatic detection and repair of errors in data structures
OOPSLA '03 Proceedings of the 18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications
Pervasive Servers: A framework for creating a society of appliances
Personal and Ubiquitous Computing
Topology based automation of distributed applications management
WOSP '04 Proceedings of the 4th international workshop on Software and performance
Experiences with Building Distributed Middleware for Home Computing on Commodity Software
ICDCSW '04 Proceedings of the 24th International Conference on Distributed Computing Systems Workshops - W7: EC (ICDCSW'04) - Volume 7
Improving availability with recursive microreboots: a soft-state system case study
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Combining routing and traffic data for detection of IP forwarding anomalies
Proceedings of the joint international conference on Measurement and modeling of computer systems
The Case for Lifetime Reliability-Aware Microprocessors
Proceedings of the 31st annual international symposium on Computer architecture
The dawning of the autonomic computing era
IBM Systems Journal
Dealing with ghosts: Managing the user experience of autonomic computing
IBM Systems Journal
IP forwarding anomalies and improving their detection using multiple data sources
Proceedings of the ACM SIGCOMM workshop on Network troubleshooting: research, theory and operations practice meet malfunctioning reality
Devirtualizable virtual machines enabling general, single-node, online maintenance
ASPLOS XI Proceedings of the 11th international conference on Architectural support for programming languages and operating systems
Field studies of computer system administrators: analysis of system management tools and practices
CSCW '04 Proceedings of the 2004 ACM conference on Computer supported cooperative work
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
A line in the sand: a wireless sensor network for target detection, classification, and tracking
Computer Networks: The International Journal of Computer and Telecommunications Networking - Special issue: Military communications systems and technologies
Improving the reliability of commodity operating systems
ACM Transactions on Computer Systems (TOCS)
A Simple Way to Estimate the Cost of Downtime
LISA '02 Proceedings of the 16th USENIX conference on System administration
A New Undo Function for Web-Based Management Information Systems
IEEE Internet Computing
Approaches for Service Deployment
IEEE Internet Computing
Quantifying the Performability of Cluster-Based Services
IEEE Transactions on Parallel and Distributed Systems
Design, Implementation, and Evaluation of a Repairable Database Management System
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Destructive Transaction: Human-Oriented Cluster System Management Mechanism
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Data structure repair using goal-directed reasoning
Proceedings of the 27th international conference on Software engineering
Ensuring stable performance for systems that degrade
Proceedings of the 5th international workshop on Software and performance
Combining statistical monitoring and predictable recovery for self-management
WOSS '04 Proceedings of the 1st ACM SIGSOFT workshop on Self-managed systems
Autonomic computing: emerging trends and open problems
DEAS '05 Proceedings of the 2005 workshop on Design and evolution of autonomic application software
Proceedings of the twentieth ACM symposium on Operating systems principles
Rx: treating bugs as allergies---a safe method to survive software failures
Proceedings of the twentieth ACM symposium on Operating systems principles
AOSD for internet service clusters: the case of availability
AOMD '05 Proceedings of the 1st workshop on Aspect oriented middleware development
Proceedings of the twentieth ACM symposium on Operating systems principles
HANet: a framework toward ultimately reliable network services
Journal of Systems and Software
Using managed communication channels in software components
Proceedings of the 3rd conference on Computing frontiers
Inference and enforcement of data structure consistency specifications
Proceedings of the 2006 international symposium on Software testing and analysis
Fault Monitoring and Detection of Distributed Services over Local and Wide Area Networks
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 2
Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications
A commensalistic software system
Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications
eTuner: tuning schema matching software using synthetic scenarios
The VLDB Journal — The International Journal on Very Large Data Bases
ACM Transactions on Computer Systems (TOCS)
A: an assertion language for distributed systems
Proceedings of the 3rd workshop on Programming languages and operating systems: linguistic support for modern operating systems
Ensuring system performance for cluster and single server systems
Journal of Systems and Software
Design guidelines for system administration tools developed through ethnographic field studies
Proceedings of the 2007 symposium on Computer human interaction for the management of information technology
BitVault: a highly reliable distributed data retention platform
ACM SIGOPS Operating Systems Review - Systems work at Microsoft Research
Bionic autonomic nervous system and self-healing for NASA ANTS-like missions
Proceedings of the 2007 ACM symposium on Applied computing
Undo for operators: building an undoable e-mail store
ATEC '03 Proceedings of the annual conference on USENIX Annual Technical Conference
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Goal-Directed Reasoning for Specification-Based Data Structure Repair
IEEE Transactions on Software Engineering
POLUS: A POwerful Live Updating System
ICSE '07 Proceedings of the 29th international conference on Software Engineering
Path-based faliure and evolution management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Total recall: system support for automated availability management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Understanding and dealing with operator mistakes in internet services
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Kernel support for zero-loss Internet service restart
Software—Practice & Experience
Rx: Treating bugs as allergies—a safe method to survive software failures
ACM Transactions on Computer Systems (TOCS)
Toward recovery-oriented computing
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Epi-aspects: aspect-oriented conscientious software
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems
IEEE Transactions on Knowledge and Data Engineering
Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers
IEEE Transactions on Dependable and Secure Computing
Queue - Virtualization
Switchblade: enforcing dynamic personalized system call models
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
Automatic software fault localization using generic program invariants
Proceedings of the 2008 ACM symposium on Applied computing
Software engineering and formal methods
Communications of the ACM - Enterprise information integration: and other tools for merging data
Rerun: Exploiting Episodes for Lightweight Memory Race Recording
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture
The FOREVER service for fault/intrusion removal
Proceedings of the 2nd workshop on Recent advances on intrusiton-tolerant systems
Bristlecone: A Language for Robust Software Systems
ECOOP '08 Proceedings of the 22nd European conference on Object-Oriented Programming
Self-stabilizing device drivers
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
A self-stabilizing autonomic recoverer for eventual Byzantine software
Journal of Systems and Software
Technical, Commercial and Regulatory Challenges of QoS: An Internet Service Model Perspective
Technical, Commercial and Regulatory Challenges of QoS: An Internet Service Model Perspective
Uncertainty explicit assessment of off-the-shelf software: A Bayesian approach
Information and Software Technology
Work practices of system administrators: implications for tool design
Proceedings of the 2nd ACM Symposium on Computer Human Interaction for Management of Information Technology
Sysadmins and the need for verification information
Proceedings of the 2nd ACM Symposium on Computer Human Interaction for Management of Information Technology
Network-Wide Rollback Scheme for Fast Recovery from Operator Errors Toward Dependable Network
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
LISA'08 Proceedings of the 22nd conference on Large installation system administration conference
Software—Practice & Experience
Understanding customer problem troubleshooting from storage system logs
FAST '09 Proccedings of the 7th conference on File and storage technologies
Modular data centers: how to design them?
Proceedings of the 1st ACM workshop on Large-Scale system and application performance
ICAC '09 Proceedings of the 6th international conference on Autonomic computing
Proceedings of the 2006 conference on Knowledge-Based Software Engineering: Proceedings of the Seventh Joint Conference on Knowledge-Based Software Engineering
Autonomic communications and the reflex unified fault management architecture
Advanced Engineering Informatics
FPGA based distributed self healing architecture for reusable systems
Cluster Computing
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
The Design of Finite State Machine for Asynchronous Replication Protocol
ICIC '07 Proceedings of the 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence
Self-healing and Hybrid Diagnosis in Cloud Computing
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Autonomic networks: engineering the self-healing property
Engineering Applications of Artificial Intelligence
Optimizing crash dump in virtualized environments
Proceedings of the 6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Fingerprinting the datacenter: automated classification of performance crises
Proceedings of the 5th European conference on Computer systems
SSS'06 Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systems
Self-stabilizing device drivers
SSS'06 Proceedings of the 8th international conference on Stabilization, safety, and security of distributed systems
ProMAS'06 Proceedings of the 4th international conference on Programming multi-agent systems
Protecting and recovering database systems continuously
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Toward automatic policy refinement in repair services for large distributed systems
ACM SIGOPS Operating Systems Review
A practical learning-based approach for dynamic storage bandwidth allocation
IWQoS'03 Proceedings of the 11th international conference on Quality of service
Measuring reactability of persistent computing systems
SC'07 Proceedings of the 6th international conference on Software composition
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
A service delivery platform for server management services
IBM Journal of Research and Development
Using allopoietic agents in replicated software to respond to errors, faults, and attacks
Proceedings of the 48th Annual Southeast Regional Conference
ACMOS'07 Proceedings of the 9th WSEAS international conference on Automatic control, modelling and simulation
Journal of Systems and Software
Distributed middleware reliability and fault tolerance support in system S
Proceedings of the 5th ACM international conference on Distributed event-based system
An empirical study on configuration errors in commercial and open source systems
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Towards reliable storage systems
Towards reliable storage systems
Memory management for self-stabilizing operating systems
SSS'05 Proceedings of the 7th international conference on Self-Stabilizing Systems
Security in persistently reactive systems
EUC'05 Proceedings of the 2005 international conference on Embedded and Ubiquitous Computing
Exception handling in the choices operating system
Advanced Topics in Exception Handling Techniques
DRO+: a systemic and economical approach to improve availability of massive database systems
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Framework for enabling highly available distributed applications for utility computing
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
A service-oriented taxonomical spectrum, cloudy challenges and opportunities of cloud computing
International Journal of Communication Systems
Brief announcement: reconfigurable state machine replication from non-reconfigurable building blocks
PODC '12 Proceedings of the 2012 ACM symposium on Principles of distributed computing
NetPilot: automating datacenter network failure mitigation
Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication
Automated diagnosis without predictability is a recipe for failure
HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
NetPilot: automating datacenter network failure mitigation
ACM SIGCOMM Computer Communication Review - Special october issue SIGCOMM '12
Optimizing decomposition of software architecture for local recovery
Software Quality Control
Autonomous, failure-resilient orchestration of distributed discrete event simulations
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Global technology trends: perspectives from IBM Research Australia on resilient systems
International Journal of Computational Science and Engineering
Bionic Autonomic Nervous Systems for Self-Defense against DoS, Spyware, Malware, Virus, and Fishing
ACM Transactions on Autonomous and Adaptive Systems (TAAS)
On improving the dependability of cloud applications with fault-tolerance
Proceedings of the WICSA 2014 Companion Volume
Hi-index | 0.00 |
It is time to broaden our performance-dominated research agenda. A four order of magnitude increase in performance since the first ASPLOS in 1982 means that few outside CS&E research community believe that speed is the only problem of computer hardware and software. Current systems crash and freeze so frequently that people become violent. Fast but flaky should not be our 21st century legacy. Recovery Oriented Computing (ROC) takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved. By concentrating on Mean Time to Repair (MTTR) rather than Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. Since a large portion of system administration is dealing with failures, ROC may also reduce total cost of ownership. One to two orders of magnitude reduction in cost mean that the purchase price of hardware and software is now a small part of the total cost of ownership. In addition to giving the motivation and definition of ROC, we introduce failure data for Internet sites that shows that the leading cause of outages is operator error. We also demonstrate five ROC techniques in five case studies, which we hope will influence designers of architectures and operating systems. If we embrace availability and maintainability, systems of the future may compete on recovery performance rather than just SPEC performance, and on total cost of ownership rather than just system price. Such a change may restore our pride in the architectures and operating systems we craft.