Fault-tolerant grid services using primary-backup: feasibility and performance

Authors:
Xianan Zhang;D. Zagorodnov;M. Hiltunen;K. Marzullo;R. D. Schlichting
Affiliations:
California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA;California Univ., San Diego, CA, USA
Venue:
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Year:
2004

Citing 0
Cited 13

FTWeb: A Fault Tolerant Infrastructure for Web Services

EDOC '05 Proceedings of the Ninth IEEE International EDOC Enterprise Computing Conference
On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Journal of Parallel and Distributed Computing
Automatic replication of WSRF-based Grid services via operation providers

Future Generation Computer Systems
Supporting fault-tolerance for time-critical events in distributed environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
The reliability analysis of resiliency framework for Grid Services

ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Supporting fault-tolerance for time-critical events in distributed environments

Scientific Programming
Architecture-based fault tolerance support for grid applications

Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Using agreement services in grid computing

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Fault-Tolerant scheduling for bag-of-tasks grid applications

EGC'05 Proceedings of the 2005 European conference on Advances in Grid Computing
Framework for enabling highly available distributed applications for utility computing

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Quality-of-service-aware fault tolerance for grid-enabled applications

Optical Switching and Networking
Time-constrained high-fidelity rendering on local desktop grids

EG PGV'09 Proceedings of the 9th Eurographics conference on Parallel Graphics and Visualization
A survey on reliability in distributed systems

Journal of Computer and System Sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

The combination of grid technology and Web services has produced an attractive platform for deploying distributed applications: grid services, as represented by the Open Grid Services Infrastructure (OGSI) and its Globus toolkit implementation. As the use of grid services grows in popularity, tolerating failures becomes increasingly important. This work addresses the problem of building a reliable and highly-available grid service by replicating the service on two or more hosts using the primary-backup approach. The primary goal is to evaluate the ease and efficiency with which this can be done, by first designing a primary-backup protocol using OGSI, and then implementing it using Globus to evaluate performance implications and tradeoffs. We compared three implementations: one that makes heavy use of the notification interface defined in OGSI, one that uses standard grid service requests instead of notification, and one that uses low-level socket primitives. The overall conclusion is that, while the performance penalty of using Globus primitives - especially notification - for replica coordination can be significant, the OGSI model is suitable for building highly-available services and it makes the task of engineering such services easier.