Measurement and modeling of computer reliability as affected by system activity
ACM Transactions on Computer Systems (TOCS)
Performability Analysis: Measures, an Algorithm, and a Case Study
IEEE Transactions on Computers - Fault-Tolerant Computing
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data
IEEE Transactions on Computers
Availability and reliability modeling for computer systems
Advances in computers
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Probability and Statistics with Reliability, Queuing and Computer Science Applications
Measurement-based reliability/performability models
Measurement-based reliability/performability models
Analysis and Modeling of Correlated Failures in Multicomputer Systems
IEEE Transactions on Computers - Special issue on fault-tolerant computing
MEASURE+: a measurement-based dependability analysis package
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Analysis of Checkpointing Schemes with Task Duplication
IEEE Transactions on Computers
Analysis and implementation of software rejuvenation in cluster systems
Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Measurement-based Analysis of Networked System Availability
Performance Evaluation: Origins and Directions
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
A comparative analysis of event tupling schemes
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Failure Data Analysis of a LAN of Windows NT Based Computers
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations
ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
VAX/VMS Event Monitoring and Analysis
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A performance model of highly available multicomputer systems
International Journal of Modelling and Simulation
On the dynamic resource availability in grids
GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An outlook on design technologies for future integrated systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Proactive management of software aging
IBM Journal of Research and Development
A survey of online failure prediction methods
ACM Computing Surveys (CSUR)
The study of software reliability evaluation and testing method
IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
A model for space-correlated failures in large-scale distributed systems
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Model based approach for autonomic availability management
ISAS'06 Proceedings of the Third international conference on Service Availability
Study on application server aging prediction based on wavelet network with hybrid genetic algorithm
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Hi-index | 14.98 |
A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant.