Fault-tolerant grid architecture and practice

Authors:
Hai Jin;DeQing Zou;HanHua Chen;JianHua Sun;Song Wu
Affiliations:
Huazhong University of Science and Technology, Wuhan 430074, P.R. China;Huazhong University of Science and Technology, Wuhan 430074, P.R. China;Huazhong University of Science and Technology, Wuhan 430074, P.R. China;Huazhong University of Science and Technology, Wuhan 430074, P.R. China;Huazhong University of Science and Technology, Wuhan 430074, P.R. China
Venue:
Journal of Computer Science and Technology - Grid computing
Year:
2003

Citing 17
Cited 4

Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
A worldwide flock of Condors: load sharing among workstation clusters

Future Generation Computer Systems - Special issue: resource management in distributed systems
Fault-tolerant broadcasts and related problems

Distributed systems (2nd Ed.)
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Application-level scheduling on distributed heterogeneous networks

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Genuine atomic multicast in asynchronous distributed systems

Theoretical Computer Science
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A fault detection service for wide area distributed computations

Cluster Computing
Scalable Fault-Tolerant Aggregation in Large Process Groups

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Lightweight Probabilistic Broadcast

DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Toward a Common Component Architecture for High-Performance Scientific Computing

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Grids as Production Computing Environments: The Engineering Aspects of NASA's Information Power Grid

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Design and Performance of Horus: A Lightweight Group Communications System

Design and Performance of Horus: A Lightweight Group Communications System
Grid Information Services for Distributed Resource Sharing

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
[15] Peer-to-Peer Architecture Case Study: Gnutella Network

P2P '01 Proceedings of the First International Conference on Peer-to-Peer Computing
End-to-end authorization

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
SNMP and SNMPv2: the infrastructure for network management

IEEE Communications Magazine

An adaptive meta-scheduler for data-intensive applications

International Journal of Grid and Utility Computing
The reliability analysis of resiliency framework for Grid Services

ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
ID-based authenticated multi-group keys agreement scheme for computing grid

AICI'10 Proceedings of the 2010 international conference on Artificial intelligence and computational intelligence: Part II
A fault-tolerant scheduling system for computational grids

Computers and Electrical Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Grid computing emerges as effective technologies to couple geographically distributed resources and solve large-scale computational problems in wide area networked. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globus fault detection service uses the well-known techniques based on unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in a grid system, and a convenient toolkit is also needed to maintain the consistency in the grid. A fault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus fault detection service is presented in this paper. The platform offers effective strategies in such three aspects as grid key components, user tasks, and high-level applications.