Survey: Survey of fault tolerant techniques for grid

Authors:
S. Siva Sathya;K. Syam Babu
Affiliations:
Ramanujan School of Mathematics & Computer Science, Pondicherry University, Pondicherry-605014, India;Ramanujan School of Mathematics & Computer Science, Pondicherry University, Pondicherry-605014, India
Venue:
Computer Science Review
Year:
2010

Citing 15
Cited 3

On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication

WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Failure Detectors as First Class Objects

DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
An Enabling Framework for Master-Worker Applications on the Computational Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Error Scope on a Computational Grid: Theory and Practice

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
Volunteer Availability based Fault Tolerant Scheduling Mechanism in Desktop Grid Computing Environment

NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware

MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Failure Detection and Membership Management in Grid Environments

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Checkpoint and Restart for Distributed Components in XCAT3

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An architecture for checkpointing and migration of distributed components on the grid

An architecture for checkpointing and migration of distributed components on the grid
Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A planning-based approach to failure recovery in distributed systems

A planning-based approach to failure recovery in distributed systems
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Recent advances in checkpoint/recovery systems

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

A fault-tolerant scheduling system for computational grids

Computers and Electrical Engineering
Performance analysis of replication mechanism using mobile agent in computational grid using WADE

International Journal of Information and Communication Technology
Behavioral modeling and formal verification of a resource discovery approach in Grid computing

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Besides the dynamic nature of grids, which means that resources may enter and leave the grid at any time, in many cases outside of the applications' control, grid resources are also heterogeneous in nature. Many grid applications will be running in environments where interaction faults are more likely to occur between disparate grid nodes. As resources may also be used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious. Due to the diverse faults and failure conditions, developing, deploying, and executing long running applications over the grid remains a challenge. So fault tolerance is an essential factor for grid computing. This paper presents an extensive survey of different fault tolerant techniques such as replication strategies, check-pointing mechanisms, scheduling policies, failure detection mechanisms and finally malleability and migration support for divide-and-conquer applications. These techniques are used according to the needs of the computational grid and the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each has its own merits and demerits which forms the subject matter of this survey.