A resource management and fault tolerance services in grid computing

Authors:
HwaMin Lee;KwangSik Chung;SungHo Chin;JongHyuk Lee;DaeWon Lee;Seongbin Park;HeonChang Yu
Affiliations:
Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science, Korea National Open University, 169, Dongsung-Dong, Chongno-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea
Venue:
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part II
Year:
2005

Citing 18
Cited 7

The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Machine Learning

Machine Learning
Online Prediction of the Running Time of Tasks

Cluster Computing
Performance Contracts: Predicting and Monitoring Grid Application Behavior

GRID '01 Proceedings of the Second International Workshop on Grid Computing
Transparent Fault Tolerance for Web Services Based Architectures

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
An Infrastructure for Monitoring and Management in Computational Grids

LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Performance Oriented Migration Framework For The Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Legion-a view from 50,000 feet

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Resource Co-Allocation in Computational Grids

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Resource Management through Multilateral Matchmaking

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Representing Dynamic Performance Information in Grid Environments with the Network Weather Service

CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Grid Information Services for Distributed Resource Sharing

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
Conservative Scheduling: Using Predicted Variance to Improve Scheduling Decisions in Dynamic Environments

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Enabling applications for grid computing with globus

Enabling applications for grid computing with globus

Temporal dimension for job submission description language

SEPADS'08 Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
Adaptive checkpointing strategy to tolerate faults in economy based grid

The Journal of Supercomputing
A hybrid fault tolerance technique in grid computing system

The Journal of Supercomputing
Service monitoring and differentiation techniques for resource allocation in the grid, on the basis of the level of service

Future Generation Computer Systems
A fault-tolerant scheduling system for computational grids

Computers and Electrical Engineering
Average schedule length and resource selection policies on computational grids

GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Performance evaluation of cloud service considering fault recovery

The Journal of Supercomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.