The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Machine Learning
Online Prediction of the Running Time of Tasks
Cluster Computing
Performance Contracts: Predicting and Monitoring Grid Application Behavior
GRID '01 Proceedings of the Second International Workshop on Grid Computing
Transparent Fault Tolerance for Web Services Based Architectures
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
An Infrastructure for Monitoring and Management in Computational Grids
LCR '00 Selected Papers from the 5th International Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers
A Performance Oriented Migration Framework For The Grid
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Legion-a view from 50,000 feet
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Resource Co-Allocation in Computational Grids
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Resource Management through Multilateral Matchmaking
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Representing Dynamic Performance Information in Grid Environments with the Network Weather Service
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
Grid Information Services for Distributed Resource Sharing
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
The Grid 2: Blueprint for a New Computing Infrastructure
The Grid 2: Blueprint for a New Computing Infrastructure
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Enabling applications for grid computing with globus
Enabling applications for grid computing with globus
Temporal dimension for job submission description language
SEPADS'08 Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems
Adaptive checkpointing strategy to tolerate faults in economy based grid
The Journal of Supercomputing
A hybrid fault tolerance technique in grid computing system
The Journal of Supercomputing
Future Generation Computer Systems
A fault-tolerant scheduling system for computational grids
Computers and Electrical Engineering
Average schedule length and resource selection policies on computational grids
GPC'06 Proceedings of the First international conference on Advances in Grid and Pervasive Computing
Performance evaluation of cloud service considering fault recovery
The Journal of Supercomputing
Hi-index | 0.01 |
In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur.