Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

Authors:
Subhachandra Chandra;Peter M. Chen
Affiliations:
-;-
Venue:
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
Year:
2000

Citing 0
Cited 22

Rx: treating bugs as allergies---a safe method to survive software failures

Proceedings of the twentieth ACM symposium on Operating systems principles
Have things changed now?: an empirical study of bug characteristics in modern open source software

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Recovering device drivers

ACM Transactions on Computer Systems (TOCS)
On modeling and tolerating incorrect software

Journal of High Speed Networks - Self-Stabilizing Systems, Part 2
Correlating multi-session attacks via replay

HOTDEP'06 Proceedings of the 2nd conference on Hot Topics in System Dependability - Volume 2
Treating bugs as allergies: a safe method for surviving software failures

HOTOS'05 Proceedings of the 10th conference on Hot Topics in Operating Systems - Volume 10
Exploring failure transparency and the limits of generic recovery

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Recovering device drivers

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems (TOCS)
Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers

IEEE Transactions on Dependable and Secure Computing
Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
Uncertainty explicit assessment of off-the-shelf software: A Bayesian approach

Information and Software Technology
Using Inherent Service Redundancy and Diversity to Ensure Web Services Dependability

Methods, Models and Tools for Fault Tolerance
An empirical study of reported bugs in server software with implications for automated bug diagnosis

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Correlating multi-session attacks via replay

HotDep'06 Proceedings of the Second conference on Hot topics in system dependability
How do programs become more concurrent: a story of program transformations

Proceedings of the 4th International Workshop on Multicore Software Engineering
Locating failure-inducing environment changes

Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools
Dependable composite web services with components upgraded online

Architecting Dependable Systems III
F(I)MEA-technique of web services analysis and dependability ensuring

Rigorous Development of Complex Fault-Tolerant Systems
How does testing affect the availability of aging software systems?

Performance Evaluation
A characteristic study on failures of production distributed data-parallel programs

Proceedings of the 2013 International Conference on Software Engineering
Discovering, reporting, and fixing performance bugs

Proceedings of the 10th Working Conference on Mining Software Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper tests the hypothesis that generic recovery techniques, such as process pairs, can survive most application faults without using application-specific information. We examine in detail the faults that occur in three, large, open-source applications: the Apache web server, the GNOME desktop environment, and the MySQL database. Using information contained in the bug reports and source code, we classify faults based on how they depend on the operating environment. We find that 72-87% of the faults are independent of the operating environment and are hence deterministic (non-transient). Recovering from the failures caused by these faults requires the use of application-specific knowledge. Half of the remaining faults depend on a condition in the operating environment that is likely to persist on retry, and the failures caused by these faults are likely to require application-specific recovery. Unfortunately, only 5-14% of the faults were triggered by transient conditions, such as timing and synchronization that naturally fix them during recovery. Our results indicate that classical application-generic recovery techniques, such as process pairs, will not be sufficient to enable applications to survive most failures caused by application faults.