Dependability Measurement and Modeling of a Multicomputer System

Authors:
D. Tang;R. K. Iyer
Affiliations:
-;-
Venue:
IEEE Transactions on Computers
Year:
1993

Citing 6
Cited 21

Measurement and modeling of computer reliability as affected by system activity

ACM Transactions on Computer Systems (TOCS)
Performability Analysis: Measures, an Algorithm, and a Case Study

IEEE Transactions on Computers - Fault-Tolerant Computing
Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data

IEEE Transactions on Computers
Availability and reliability modeling for computer systems

Advances in computers
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Measurement-based reliability/performability models

Measurement-based reliability/performability models

Analysis and Modeling of Correlated Failures in Multicomputer Systems

IEEE Transactions on Computers - Special issue on fault-tolerant computing
MEASURE+: a measurement-based dependability analysis package

SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Analysis of Checkpointing Schemes with Task Duplication

IEEE Transactions on Computers
Analysis and implementation of software rejuvenation in cluster systems

Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Measurement-based Analysis of Networked System Availability

Performance Evaluation: Origins and Directions
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data

Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
A comparative analysis of event tupling schemes

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Failure Data Analysis of a LAN of Windows NT Based Computers

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Analyze-NOW-an environment for collection and analysis of failures in a network of workstations

ISSRE '96 Proceedings of the The Seventh International Symposium on Software Reliability Engineering
A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
VAX/VMS Event Monitoring and Analysis

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
A performance model of highly available multicomputer systems

International Journal of Modelling and Simulation
On the dynamic resource availability in grids

GRID '07 Proceedings of the 8th IEEE/ACM International Conference on Grid Computing
An outlook on design technologies for future integrated systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Proactive management of software aging

IBM Journal of Research and Development
A survey of online failure prediction methods

ACM Computing Surveys (CSUR)
The study of software reliability evaluation and testing method

IITA'09 Proceedings of the 3rd international conference on Intelligent information technology application
A model for space-correlated failures in large-scale distributed systems

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
Experimental evaluation

FTCS'95 Proceedings of the Twenty-Fifth international conference on Fault-tolerant computing
Model based approach for autonomic availability management

ISAS'06 Proceedings of the Third international conference on Service Availability
Study on application server aging prediction based on wavelet network with hybrid genetic algorithm

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	14.98

Visualization

Abstract

A measurement-based analysis of error data collected from a DEC VAXcluster multicomputer system is presented. Basic system dependability characteristics such as error/failure distributions and hazard rate are obtained for both the individual machine and the entire VAXcluster. Markov reward models are developed to analyze error/failure behavior and to evaluate performance loss due to errors/failures. Correlation analysis is then performed to quantify relationships of error/failures across machines and across time. It is found that shared resources constitute a major reliability bottleneck. It is shown that for measured system, the homogeneous Markov model, which assumes constant failure rates, overestimates the transient reward rate for the short-term operation, and underestimates it for the long-term operation. Correlation analysis shows that errors are highly correlated across machines and across time. The failure correlation coefficient is low. However, its effect on system unavailability is significant.