Reliability and availability issues in large-scale distributed systems

Authors:
Angel A. Juan;Joan M. Marquès;Dragos Ionescu;Javier Faulin
Affiliations:
Open University of Catalonia, Barcelona, Spain;Open University of Catalonia, Barcelona, Spain;Massachusetts Institute of Technology, Cambridge, MA;Public University of Navarre, Pamplona, Spain
Venue:
Proceedings of the Winter Simulation Conference
Year:
2010

Citing 12
Cited 0

Hierarchical Modeling of Availability in Distributed Systems

IEEE Transactions on Software Engineering
Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design

Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design
Is remote host availability governed by a universal law?

ACM SIGMETRICS Performance Evaluation Review
Availability simulation of peer-to-peer architectural styles

WADS '05 Proceedings of the 2005 workshop on Architecting dependable systems
Total recall: system support for automated availability management

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Exploiting availability prediction in distributed systems

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
J-SAEDES: a java-based simulation software to improve reliability and availability of computer systems and networks

Proceedings of the 39th conference on Winter simulation: 40 years! The best is yet to come
Ensuring Collective Availability in Volatile Resource Pools Via Forecasting

DSOM '08 Proceedings of the 19th IFIP/IEEE international workshop on Distributed Systems: Operations and Management: Managing Large-Scale Service Deployment
AVMON: Optimal and Scalable Discovery of Consistent Availability Monitoring Overlays for Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
Towards an architecture for service deployment in contributory communities

International Journal of Grid and Utility Computing
On correlated availability in Internet-distributed systems

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
Simulation Methods for Reliability and Availability of Complex Systems

Simulation Methods for Reliability and Availability of Complex Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale distributed systems, such as Overnet, BOINC (SETI@home) or PlanetLab, provide attractive options through aggregation and sharing of heterogeneous and geographically dispersed computer resources. However, in order to be efficient, these systems need to consider some issues related to the Reliability and Availability (R&A) levels of their nodes and the services they offer. These systems are usually characterized by extremely dynamic and heterogeneous environments, where nodes offering different computer capabilities and features can enter and leave freely. But dynamism and heterogeneity introduce uncertainty and make it difficult to develop accurate models to predict the temporal evolution of the R&A levels in distributed environments. This paper reviews some R&A issues in large-scale distributed systems and studies how they relate to the quality of service offered to the users. The paper also discusses the role of simulation as the most natural way to deal with these issues and introduces a simulation-based methodology that allows to design reliable and cost-efficient distributed services.