On the Reliability of the IBM MVS/XA Operating System
IEEE Transactions on Software Engineering
Computer event monitoring and analysis
Computer event monitoring and analysis
Top five challenges facing the practice of fault-tolerance
Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
Recognition of error symptoms in large systems
ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Dependability: Basic Concepts and Terminology
Dependability: Basic Concepts and Terminology
Hardware-Related Software Errors: Measurement and Analysis
IEEE Transactions on Software Engineering
Optimizing preventive service of software products
IBM Journal of Research and Development
Analysis of Preventive Maintenance in Transactions Based Software Systems
IEEE Transactions on Computers
Diagnosing Rediscovered Software Problems Using Symptoms
IEEE Transactions on Software Engineering
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Measurement-Based Framework for Software Reliability Improvement
Annals of Software Engineering
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Software Reliability and Rejuvenation: Modeling and Analysis
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Efficient service of rediscovered software problems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Evaluating the Impact of Communication Architecture on the Performability of Cluster-Based Services
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Primary-shadow consistency issues in the DRB scheme and the recovery time bound
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Empirical evaluation of defect projection models for widely-deployed production software systems
Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering
Reflections on Industry Trends and Experimental Research in Dependability
IEEE Transactions on Dependable and Secure Computing
Effective Fault Treatment for Improving the Dependability of COTS and Legacy-Based Applications
IEEE Transactions on Dependable and Secure Computing
Predictors of customer perceived software quality
Proceedings of the 27th international conference on Software engineering
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Proactive management of software aging
IBM Journal of Research and Development
ICIC '07 Proceedings of the 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications. With Aspects of Artificial Intelligence
Pro-active failure handling mechanisms for scheduling in grid computing environments
Journal of Parallel and Distributed Computing
Dependability metrics
A software integration approach for designing and assessing dependable embedded systems
Journal of Systems and Software
Dependable computing: concepts, limits, challenges
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
OS-level hang detection in complex software systems
International Journal of Critical Computer-Based Systems
Study on application server aging prediction based on wavelet network with hybrid genetic algorithm
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Operating system support to detect application hangs
VECoS'08 Proceedings of the Second international conference on Verification and Evaluation of Computer and Communication Systems
A Recovery-Oriented Approach for Software Fault Diagnosis in Complex Critical Systems
International Journal of Adaptive, Resilient and Autonomic Systems
Reliability Based Scheduling Model RSM for Computational Grids
International Journal of Distributed Systems and Technologies
Post-release reliability growth in software products
ACM Transactions on Software Engineering and Methodology (TOSEM)
Hi-index | 0.00 |
Abstract: In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines. The paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF reaches around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided. The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized. The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved.