ACM Transactions on Computer Systems (TOCS)
Distributed program reliability analysis
IEEE Transactions on Software Engineering
A distributed algorithm for constructing minimal spanning trees
IEEE Transactions on Software Engineering
SYREL: A Symbolic Reliability Algorithm Based on Path and Cutset Methods
IEEE Transactions on Computers
Survey of software tools for evaluating reliability, availability, and serviceability
ACM Computing Surveys (CSUR)
Petri Net Theory and the Modeling of Systems
Petri Net Theory and the Modeling of Systems
Advances in Distributed System Reliability
Advances in Distributed System Reliability
Distributed Computing Network Reliability
Distributed Computing Network Reliability
SPNP: Stochastic Petri Net Package
PNPM '89 The Proceedings of the Third International Workshop on Petri Nets and Performance Models
IEEE Transactions on Software Engineering
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Simulation of Task Graph Systems in Heterogeneous Computing Environments
HCW '99 Proceedings of the Eighth Heterogeneous Computing Workshop
Performance analysis of multistage interconnection networks with a new high-level net model
Journal of Systems Architecture: the EUROMICRO Journal
A pattern-based approach for modeling and analyzing error recovery
Architecting dependable systems IV
Hi-index | 0.00 |
Presents a modeling approach based on stochastic Petri nets to estimate the reliability and availability of programs in a distributed computing system environment. In this environment, successful execution of programs is conditioned on the successful access of related files distributed throughout the system. The use of stochastic Petri nets is demonstrated by extending a basic reliability model to account for repair actions when faults occur. To this end, two possible models are discussed: the global repair model, which assumes a centralized repair team that restores the system to its original status when a failure state is reached, and the local repair model, which assumes that repairs are localized to the node where they occur. The former model is useful in evaluating the availability of programs (or the availability of the hardware support) subject to hardware faults that are repaired globally; therefore, the programs of interest can be interrupted. On the other hand, the latter model can be used to evaluate program reliability in the presence of hardware faults subject to repair, without interrupting the normal operation of the system.