Software errors and complexity: an empirical investigation0
Communications of the ACM
Software reliability in the system context
IEEE Transactions on Software Engineering
Software reliability: measurement, prediction, application
Software reliability: measurement, prediction, application
Coverage Modeling for Dependability Analysis of Fault-Tolerant Systems
IEEE Transactions on Computers
Fault Injection for Dependability Validation: A Methodology and Some Applications
IEEE Transactions on Software Engineering
Orthogonal Defect Classification-A Concept for In-Process Measurements
IEEE Transactions on Software Engineering - Special issue on software measurement principles, techniques, and environments
A compatible hardware/software reliability prediction model
A compatible hardware/software reliability prediction model
Diagnosing Rediscovered Software Problems Using Symptoms
IEEE Transactions on Software Engineering
Modeling software design diversity: a review
ACM Computing Surveys (CSUR)
Availability analysis and improvement of active/standby cluster systems using software rejuvenation
Journal of Systems and Software
DBTel '01 Proceedings of the VLDB 2001 International Workshop on Databases in Telecommunications II
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Software Reliability and Rejuvenation: Modeling and Analysis
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
Efficient service of rediscovered software problems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
Comparing disk and memory's resistance to operating system crashes
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Software reliability engineering for client-server systems
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
An Approach to Measuring and Assessing Dependability for Critical Software Systems
ISSRE '97 Proceedings of the Eighth International Symposium on Software Reliability Engineering
Joint evaluation of recovery and performance of a COTS DBMS in the presence of operator faults
Performance Evaluation - Dependable systems and networks-performance and dependability symposium (DSN-PDS) 2002: Selected papers
Commercial Fault Tolerance: A Tale of Two Systems
IEEE Transactions on Dependable and Secure Computing
A Comprehensive Model for Software Rejuvenation
IEEE Transactions on Dependable and Secure Computing
Emulation of Software Faults: A Field Data Study and a Practical Approach
IEEE Transactions on Software Engineering
Why do internet services fail, and what can be done about it?
USITS'03 Proceedings of the 4th conference on USENIX Symposium on Internet Technologies and Systems - Volume 4
A dependability benchmark for OLTP application environments
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Research issues in software fault categorization
ACM SIGSOFT Software Engineering Notes
Fault Tolerance via Diversity for Off-the-Shelf Products: A Study with SQL Database Servers
IEEE Transactions on Dependable and Secure Computing
Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling
FAST'08 Proceedings of the 6th USENIX Conference on File and Storage Technologies
Active Diagnosis of High-Level Faults in Distributed Internet Services
APNOMS '08 Proceedings of the 11th Asia-Pacific Symposium on Network Operations and Management: Challenges for Next Generation Network Operations and Service Management
Using allopoietic agents in replicated software to respond to errors, faults, and attacks
Proceedings of the 48th Annual Southeast Regional Conference
How to advance TPC benchmarks with dependability aspects
TPCTC'10 Proceedings of the Second TPC technology conference on Performance evaluation, measurement and characterization of complex systems
Fault tolerant framework and techniques for component-based autonomous robot systems
Proceedings of the 2011 ACM Symposium on Applied Computing
Assisting failure diagnosis through filesystem instrumentation
Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research
Hi-index | 0.00 |
Based on extensive field failure data for Tandem驴s GUARDIAN operating system, this paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate.