Fault-tolerant grid resource management infrastructure

Authors:
J. H. Abawajy;S. P. Dandamudi
Affiliations:
Carleton University, School of Computer Science, Ottawa, Ontario, Canada;Carleton University, School of Computer Science, Ottawa, Ontario, Canada
Venue:
Neural, Parallel & Scientific Computations - Special issue: Grid computing
Year:
2004

Citing 12
Cited 5

Wide-Area Computing: Resource Sharing on a Large Scale

Computer
Surviving Network Partitioning

Computer
A Resource Management Architecture for Metacomputing Systems

IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The Globus Project: A Status Report

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
A Metascheduler For The Grid

HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Ethernet Approach to Grid Computing

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Integrating fault-tolerance techniques in grid applications

Integrating fault-tolerance techniques in grid applications
The GrADS Project: Software Support for High-Level Grid Application Development

International Journal of High Performance Computing Applications
Parallel I/O scheduling in multiprogrammed cluster computing systems

ICCS'03 Proceedings of the 2003 international conference on Computational science

Three-layer control policy for grid resource management

Journal of Network and Computer Applications
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure

OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Autonomic job scheduling policy for grid computing

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part III
Fault-tolerant dynamic job scheduling policy

ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems

ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main motivation for existing Grid systems is to provide mechanisms for sharing and accessing large and heterogeneous collections of remote resources. This remains the primary goal even today. However, achieving large-scale distributed computing in a seamless manner on Grid computing introduces not only the problem of efficient utilization and satisfactory response time but also the problem of fault-tolerance. With the momentum gaining for the Grid computing, the ability to tolerate failures while effectively exploiting the Grid computing resources in a scalable and transparent manner must be an integral part of Grid computing infrastructure. In this paper, we present a reconfigurable multi-layered Grid infrastructure that provides faulttolerance mechanisms to ensure that a Grid client can obtain reliable services, even if the middleware service that provides the desired services may suffer from crash failures.