A resource manager for optimal resource selection and fault tolerance service in Grids

Authors:
P. Z. Kolano
Affiliations:
Dept. of Comput. Sci. Educ., Korea Univ., Seoul, South Korea
Venue:
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Year:
2004

Citing 0
Cited 6

Flexible Grid service management through resource partitioning

The Journal of Supercomputing
Pro-active failure handling mechanisms for scheduling in grid computing environments

Journal of Parallel and Distributed Computing
Load balancing in the presence of random node failure and recovery

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Research on independent and dynamic fault-tolerant and migration technology for cloud simulation resources

Proceedings of the 2011 Grand Challenges on Modeling and Simulation Conference
Probabilistic resource allocation in heterogeneous distributed systems with random failures

Journal of Parallel and Distributed Computing
A resource discovery and allocation mechanism in large computational grids for media applications

ISPA'07 Proceedings of the 2007 international conference on Frontiers of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the issues of resource management and fault tolerance in Grids. In Grids, the state of the selected resources for job execution is a primary factor that determines the computing performance. Specifically, we propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resources among candidate resources using a genetic algorithm. Typically, the probability of failure is higher in Grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational Grids and Grid services are often expected to meet some minimum levels of quality of service (QoS) for desirable operation. To address this issue, we also propose fault tolerance service to satisfy QoS requirements. We extend the definition of failures, such as process failure, processor failure, and network failure, and design the fault detector and fault manager. The simulation results indicate that our approaches are promising in that (1) our resource manager finds the optimal set of resources that guarantees the optimal performance; (2) the fault detector detects the occurrence of resource failures; and (3) the fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen.