A resource management and fault tolerance services in grid computing

  • Authors:
  • HwaMin Lee;KwangSik Chung;SungHo Chin;JongHyuk Lee;DaeWon Lee;Seongbin Park;HeonChang Yu

  • Affiliations:
  • Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science, Korea National Open University, 169, Dongsung-Dong, Chongno-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea;Department of Computer Science Education, Korea University, 1, 5-Ka, Anam-Dong, Sungbuk-Ku, Seoul, Korea

  • Venue:
  • Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part II
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.