Hierarchical Modeling of Availability in Distributed Systems
IEEE Transactions on Software Engineering
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
Is remote host availability governed by a universal law?
ACM SIGMETRICS Performance Evaluation Review
Availability simulation of peer-to-peer architectural styles
WADS '05 Proceedings of the 2005 workshop on Architecting dependable systems
Total recall: system support for automated availability management
NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Exploiting availability prediction in distributed systems
NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Proceedings of the 39th conference on Winter simulation: 40 years! The best is yet to come
Ensuring Collective Availability in Volatile Resource Pools Via Forecasting
DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
IEEE Transactions on Parallel and Distributed Systems
Towards an architecture for service deployment in contributory communities
International Journal of Grid and Utility Computing
On correlated availability in Internet-distributed systems
GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Simulation Methods for Reliability and Availability of Complex Systems
Simulation Methods for Reliability and Availability of Complex Systems
Hi-index | 0.00 |
Large-scale distributed systems, such as Overnet, BOINC (SETI@home) or PlanetLab, provide attractive options through aggregation and sharing of heterogeneous and geographically dispersed computer resources. However, in order to be efficient, these systems need to consider some issues related to the Reliability and Availability (R&A) levels of their nodes and the services they offer. These systems are usually characterized by extremely dynamic and heterogeneous environments, where nodes offering different computer capabilities and features can enter and leave freely. But dynamism and heterogeneity introduce uncertainty and make it difficult to develop accurate models to predict the temporal evolution of the R&A levels in distributed environments. This paper reviews some R&A issues in large-scale distributed systems and studies how they relate to the quality of service offered to the users. The paper also discusses the role of simulation as the most natural way to deal with these issues and introduces a simulation-based methodology that allows to design reliable and cost-efficient distributed services.