Application level fault tolerance in heterogeneous networks of workstations
Journal of Parallel and Distributed Computing
NetSolve: a network server for solving computational science problems
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Condor: a distributed job scheduler
Beowulf cluster computing with Linux
Condor-G: A Computation Management Agent for Multi-Institutional Grids
Cluster Computing
GridFlow: Workflow Management for Grid Computing
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid
HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
A Novel Architecture for Realizing Grid Workflow using Tuple Spaces
GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
ASKALON: a tool set for cluster and Grid computing: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
Coordinated enroute multimedia object caching in transcoding proxies for tree networks
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Optimal methods for coordinated enroute web caching for tree networks
ACM Transactions on Internet Technology (TOIT)
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Multimedia Object Placement for Transparent Data Replication
IEEE Transactions on Parallel and Distributed Systems
ChinaGrid: making grid computing a reality
ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
CGSP: an extensible and reconfigurable grid framework
APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies
Hi-index | 0.00 |
A strong failure recovery mechanism handling diverse failures in heterogeneous and dynamic Grid is so important to ensure the complete execution of long-running applications. Although there have been various efforts made to address this issue, existing solutions either focus on employing only one single fault-tolerant technique without considering the diversity of failures, or propose some frameworks which cannot deal with various kinds of failures adaptively in Grid. In this paper, an adaptive task-level, fault-tolerant approach to Grid is proposed. This approach aims at handling quite a complete set of failures arising in Grid environment by integrating basic fault-tolerant approaches. Moreover, this paper puts forward that resource consumption (not received enough attention) is also an important evaluation metric for any fault-tolerant approach. The corresponding evaluation models based on mean execution time and resource consumption are constructed to evaluate any fault-tolerant approach. Based on the models, we also demonstrate the effectiveness of our approach and illustrate the performance gains achieved via simulations. The experiments based on a real Grid have been made and the results show that our approach can achieve better performance and consume less resource.