Fault-tolerant grid services using primary-backup: feasibility and performance

  • Authors:
  • Xianan Zhang;D. Zagorodnov;M. Hiltunen;K. Marzullo;R. D. Schlichting

  • Affiliations:
  • California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA

  • Venue:
  • CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The combination of grid technology and Web services has produced an attractive platform for deploying distributed applications: grid services, as represented by the Open Grid Services Infrastructure (OGSI) and its Globus toolkit implementation. As the use of grid services grows in popularity, tolerating failures becomes increasingly important. This work addresses the problem of building a reliable and highly-available grid service by replicating the service on two or more hosts using the primary-backup approach. The primary goal is to evaluate the ease and efficiency with which this can be done, by first designing a primary-backup protocol using OGSI, and then implementing it using Globus to evaluate performance implications and tradeoffs. We compared three implementations: one that makes heavy use of the notification interface defined in OGSI, one that uses standard grid service requests instead of notification, and one that uses low-level socket primitives. The overall conclusion is that, while the performance penalty of using Globus primitives - especially notification - for replica coordination can be significant, the OGSI model is suitable for building highly-available services and it makes the task of engineering such services easier.