Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

  • Authors:
  • Subhachandra Chandra;Peter M. Chen

  • Affiliations:
  • -;-

  • Venue:
  • DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper tests the hypothesis that generic recovery techniques, such as process pairs, can survive most application faults without using application-specific information. We examine in detail the faults that occur in three, large, open-source applications: the Apache web server, the GNOME desktop environment, and the MySQL database. Using information contained in the bug reports and source code, we classify faults based on how they depend on the operating environment. We find that 72-87% of the faults are independent of the operating environment and are hence deterministic (non-transient). Recovering from the failures caused by these faults requires the use of application-specific knowledge. Half of the remaining faults depend on a condition in the operating environment that is likely to persist on retry, and the failures caused by these faults are likely to require application-specific recovery. Unfortunately, only 5-14% of the faults were triggered by transient conditions, such as timing and synchronization that naturally fix them during recovery. Our results indicate that classical application-generic recovery techniques, such as process pairs, will not be sufficient to enable applications to survive most failures caused by application faults.