Analysis of a composite performance reliability measure for fault-tolerant systems

Authors:
Lorenzo Donatiello;Balakrishna R. Iyer
Affiliations:
IBM Thomas J. Watson Research Center, Yorktown Heights, NY;IBM Thomas J. Watson Research Center, Yorktown Heights, NY
Venue:
Journal of the ACM (JACM)
Year:
1987

Citing 16
Cited 10

Calculating Cumulative Operational Time Distributions of Repairable Computer Systems

IEEE Transactions on Computers - The MIT Press scientific computation series
Analysis of Performability for Stochastic Models of Fault-Tolerant Systems

IEEE Transactions on Computers
Hardware fault tolerance

Resilient computing systems: vol. 1
Modelling of centralized concurrency control in a multi-system environment

SIGMETRICS '85 Proceedings of the 1985 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Open, Closed, and Mixed Networks of Queues with Different Classes of Customers

Journal of the ACM (JACM)
Mean-Value Analysis of Closed Multichain Queuing Networks

Journal of the ACM (JACM)
Performability analysis of operation modes of configurable duplex systems

ACM '86 Proceedings of 1986 ACM Fall joint computer conference
Hybrid simulation models of computer systems

Communications of the ACM
Computer Performance Modeling Handbook

Computer Performance Modeling Handbook
Probability and Statistics with Reliability, Queuing and Computer Science Applications

Probability and Statistics with Reliability, Queuing and Computer Science Applications
Simulation of Computer Communication Systems

Simulation of Computer Communication Systems
Analysis of M/G/2 - Standby Redundant System

Performance '83 Proceedings of the 9th International Symposium on Computer Performance Modelling, Measurement and Evaluation
A combined evaluation of performance and reliability for degradable systems

SIGMETRICS '81 Proceedings of the 1981 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A NonStop kernel

SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A Unified Model for the Analysis of Job Completion Time and Performability Measures in Fault-Tolerant Systems

A Unified Model for the Analysis of Job Completion Time and Performability Measures in Fault-Tolerant Systems
Probability, Statistics, and Queueing Theory with Computer Science Applications

Probability, Statistics, and Queueing Theory with Computer Science Applications

Analysis of Performability for Stochastic Models of Fault-Tolerant Systems

IEEE Transactions on Computers
Optimal reconfiguration strategy for a degradable multimodule computing system

Journal of the ACM (JACM)
Performability Analysis: Measures, an Algorithm, and a Case Study

IEEE Transactions on Computers - Fault-Tolerant Computing
Knowledge based modeling and analysis of computer architectures

IEA/AIE '88 Proceedings of the 1st international conference on Industrial and engineering applications of artificial intelligence and expert systems - Volume 2
Optimal Dynamic Control of Resources in a Distributed System

IEEE Transactions on Software Engineering
Performability Analysis of Distributed Real-Time Systems

IEEE Transactions on Computers
On Evaluating the Cumulative Performance Distribution of Fault-Tolerant Computer Systems

IEEE Transactions on Computers
Calculating transient distributions of cumulative reward

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A Generalized Analytic Performance Model of Distributed Systems that Perform N Tasks Using P Fault-Prone Processors

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A new methodology for calculating distributions of reward accumulated during a finite interval

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Today's concomitant needs for higher computing power and reliability has increased the relevance of multiple-processor fault-tolerant systems. Multiple functional units improve the raw performance (throughput, response time, etc.) of the system, and, as units fail, the system may continue to function albeit with degraded performance. Such systems and other fault-tolerant systems are not adequately characterized by separate performance and reliability measures. A composite measure for the performance and reliability of a fault-tolerant system observed over a finite mission time is analyzed. A Markov chain model is used for system state-space representation, and transient analysis is performed to obtain closed-form solutions for the density and moments of the composite measure. Only failures that cannot be repaired until the end of the mission are modeled. The time spent in a specific system configuration is assumed to be large enough to permit the use of a hierarchical model and static measures to quantify the performance of the system in individual configurations. For a multiple-processor system, where performance measures are usually associated with and aggregated over many jobs, this is tantamount to assuming that the time to process a job is much smaller than the time between failures. An extension of the results to general acyclic Markov chain models is included.