Reliability and availability issues in large-scale distributed systems

  • Authors:
  • Angel A. Juan;Joan M. Marquès;Dragos Ionescu;Javier Faulin

  • Affiliations:
  • Open University of Catalonia, Barcelona, Spain;Open University of Catalonia, Barcelona, Spain;Massachusetts Institute of Technology, Cambridge, MA;Public University of Navarre, Pamplona, Spain

  • Venue:
  • Proceedings of the Winter Simulation Conference
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large-scale distributed systems, such as Overnet, BOINC (SETI@home) or PlanetLab, provide attractive options through aggregation and sharing of heterogeneous and geographically dispersed computer resources. However, in order to be efficient, these systems need to consider some issues related to the Reliability and Availability (R&A) levels of their nodes and the services they offer. These systems are usually characterized by extremely dynamic and heterogeneous environments, where nodes offering different computer capabilities and features can enter and leave freely. But dynamism and heterogeneity introduce uncertainty and make it difficult to develop accurate models to predict the temporal evolution of the R&A levels in distributed environments. This paper reviews some R&A issues in large-scale distributed systems and studies how they relate to the quality of service offered to the users. The paper also discusses the role of simulation as the most natural way to deal with these issues and introduces a simulation-based methodology that allows to design reliable and cost-efficient distributed services.