IBM Journal of Research and Development
Measurement-based Analysis of Networked System Availability
Performance Evaluation: Origins and Directions
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
An Experimental Study of Security Vulnerabilities Caused by Errors
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Evaluation of Software Dependability Based on Stability Test Data
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Proactive management of software aging
IBM Journal of Research and Development
EVEREST+: run-time SLA violations prediction
Proceedings of the 5th International Workshop on Middleware for Service Oriented Computing
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Performance implications of failures in large-scale cluster scheduling
JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Wi-Fi/ZigBee coexistence for fault-tolerant building automation system
International Journal of Systems, Control and Communications
Hi-index | 0.00 |
This paper presents an analysis of operating system failures on an IBM 3081 running VM/SP. We find three broad categories of software failures: error handling (ERH), program control or logic (CTL), and hardware related (HS); it is found that more than 25 percent of software failures occur in the hardware/software interface. Measurements show that results on software reliability cannot be considered representative unless the system workload is taken into account. For example, it is shown that the risk of a software failure increases in a nonlinear fashion with the amount of interactive processing, as measured by parameters such as the paging rate and the amount of overhead (operating system CPU time). The overall CPU execution rate, although measured to be close to 100 percent most of the time, is not found to correlate strongly with the occurrence of failures. The paper discusses possible reasons for the observed workload failure dependency based on detailed investigations of the failure data.