On scalable and efficient distributed failure detectors
Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication
WDAG '97 Proceedings of the 11th International Workshop on Distributed Algorithms
Failure Detectors as First Class Objects
DOA '99 Proceedings of the International Symposium on Distributed Objects and Applications
An Enabling Framework for Master-Worker Applications on the Computational Grid
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Error Scope on a Computational Grid: Theory and Practice
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
NCA '04 Proceedings of the Network Computing and Applications, Third IEEE International Symposium
Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware
MGC '04 Proceedings of the 2nd workshop on Middleware for grid computing
Failure Detection and Membership Management in Grid Environments
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Checkpoint and Restart for Distributed Components in XCAT3
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
An architecture for checkpointing and migration of distributed components on the grid
An architecture for checkpointing and migration of distributed components on the grid
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A planning-based approach to failure recovery in distributed systems
A planning-based approach to failure recovery in distributed systems
Adaptive and reliable parallel computing on networks of workstations
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Recent advances in checkpoint/recovery systems
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
A fault-tolerant scheduling system for computational grids
Computers and Electrical Engineering
Performance analysis of replication mechanism using mobile agent in computational grid using WADE
International Journal of Information and Communication Technology
Behavioral modeling and formal verification of a resource discovery approach in Grid computing
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
Besides the dynamic nature of grids, which means that resources may enter and leave the grid at any time, in many cases outside of the applications' control, grid resources are also heterogeneous in nature. Many grid applications will be running in environments where interaction faults are more likely to occur between disparate grid nodes. As resources may also be used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious. Due to the diverse faults and failure conditions, developing, deploying, and executing long running applications over the grid remains a challenge. So fault tolerance is an essential factor for grid computing. This paper presents an extensive survey of different fault tolerant techniques such as replication strategies, check-pointing mechanisms, scheduling policies, failure detection mechanisms and finally malleability and migration support for divide-and-conquer applications. These techniques are used according to the needs of the computational grid and the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each has its own merits and demerits which forms the subject matter of this survey.