An adaptive task-level fault-tolerant approach to Grid

Authors:
Yongwei Wu;Yulai Yuan;Guangwen Yang;Weimin Zheng
Affiliations:
Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, People's Republic of China 100084;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, People's Republic of China 100084;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, People's Republic of China 100084;Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing, People's Republic of China 100084
Venue:
The Journal of Supercomputing
Year:
2010

Citing 17
Cited 0

Application level fault tolerance in heterogeneous networks of workstations

Journal of Parallel and Distributed Computing
NetSolve: a network server for solving computational science problems

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Condor: a distributed job scheduler

Beowulf cluster computing with Linux
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
GridFlow: Workflow Management for Grid Computing

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
A Novel Architecture for Realizing Grid Workflow using Tuple Spaces

GRID '04 Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing
ASKALON: a tool set for cluster and Grid computing: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Coordinated enroute multimedia object caching in transcoding proxies for tree networks

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Optimal methods for coordinated enroute web caching for tree networks

ACM Transactions on Internet Technology (TOIT)
Taverna: a tool for the composition and enactment of bioinformatics workflows

Bioinformatics
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Multimedia Object Placement for Transparent Data Replication

IEEE Transactions on Parallel and Distributed Systems
ChinaGrid: making grid computing a reality

ICADL'04 Proceedings of the 7th international Conference on Digital Libraries: international collaboration and cross-fertilization
CGSP: an extensible and reconfigurable grid framework

APPT'05 Proceedings of the 6th international conference on Advanced Parallel Processing Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A strong failure recovery mechanism handling diverse failures in heterogeneous and dynamic Grid is so important to ensure the complete execution of long-running applications. Although there have been various efforts made to address this issue, existing solutions either focus on employing only one single fault-tolerant technique without considering the diversity of failures, or propose some frameworks which cannot deal with various kinds of failures adaptively in Grid. In this paper, an adaptive task-level, fault-tolerant approach to Grid is proposed. This approach aims at handling quite a complete set of failures arising in Grid environment by integrating basic fault-tolerant approaches. Moreover, this paper puts forward that resource consumption (not received enough attention) is also an important evaluation metric for any fault-tolerant approach. The corresponding evaluation models based on mean execution time and resource consumption are constructed to evaluate any fault-tolerant approach. Based on the models, we also demonstrate the effectiveness of our approach and illustrate the performance gains achieved via simulations. The experiments based on a real Grid have been made and the results show that our approach can achieve better performance and consume less resource.