Measurement of Failure Rate in Widely Distributed Software

Authors:
R. Chillarege;S. Biyani;J. Rosenthal
Affiliations:
-;-;-
Venue:
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Year:
1995

Citing 7
Cited 31

On the Reliability of the IBM MVS/XA Operating System

IEEE Transactions on Software Engineering
Computer event monitoring and analysis

Computer event monitoring and analysis
Top five challenges facing the practice of fault-tolerance

Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
Recognition of error symptoms in large systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Dependability: Basic Concepts and Terminology

Dependability: Basic Concepts and Terminology
Hardware-Related Software Errors: Measurement and Analysis

IEEE Transactions on Software Engineering
Optimizing preventive service of software products

IBM Journal of Research and Development

Analysis of Preventive Maintenance in Transactions Based Software Systems

IEEE Transactions on Computers
Diagnosing Rediscovered Software Problems Using Symptoms

IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Measurement-Based Framework for Software Reliability Improvement

Annals of Software Engineering
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Software Reliability and Rejuvenation: Modeling and Analysis

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Efficient service of rediscovered software problems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Primary-shadow consistency issues in the DRB scheme and the recovery time bound

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Error injection aimed at fault removal in fault tolerance mechanisms-criteria for error selection using field data on software faults

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Empirical evaluation of defect projection models for widely-deployed production software systems

Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications

IEEE Transactions on Dependable and Secure Computing
Predictors of customer perceived software quality

Proceedings of the 27th international conference on Software engineering
A Comprehensive Model for Software Rejuvenation

IEEE Transactions on Dependable and Secure Computing
Proactive management of software aging

IBM Journal of Research and Development
Application Server Aging Prediction Model Based on Wavelet Network with Adaptive Particle Swarm Optimization Algorithm

ICIC '07 Proceedings of the 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence
Pro-active failure handling mechanisms for scheduling in grid computing environments

Journal of Parallel and Distributed Computing
References

Dependability metrics
A software integration approach for designing and assessing dependable embedded systems

Journal of Systems and Software
Dependable computing: concepts, limits, challenges

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Experimental evaluation

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
OS-level hang detection in complex software systems

International Journal of Critical Computer-Based Systems
Study on application server aging prediction based on wavelet network with hybrid genetic algorithm

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Operating system support to detect application hangs

VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems

International Journal of Adaptive, Resilient and Autonomic Systems
Reliability Based Scheduling Model RSM for Computational Grids

International Journal of Distributed Systems and Technologies
Post-release reliability growth in software products

ACM Transactions on Software Engineering and Methodology (TOSEM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved.