Surviving Network Partitioning
Computer
A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
The Globus Project: A Status Report
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
A Fault Detection Service for Wide Area Distributed Computations
HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
The Ethernet Approach to Grid Computing
HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid
HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Integrating fault-tolerance techniques in grid applications
Integrating fault-tolerance techniques in grid applications
The GrADS Project: Software Support for High-Level Grid Application Development
International Journal of High Performance Computing Applications
Parallel I/O scheduling in multiprogrammed cluster computing systems
ICCS'03 Proceedings of the 2003 international conference on Computational science
Three-layer control policy for grid resource management
Journal of Network and Computer Applications
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Autonomic job scheduling policy for grid computing
ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part III
Fault-tolerant dynamic job scheduling policy
ICA3PP'05 Proceedings of the 6th international conference on Algorithms and Architectures for Parallel Processing
Robust parallel job scheduling infrastructure for service-oriented grid computing systems
ICCSA'05 Proceedings of the 2005 international conference on Computational Science and Its Applications - Volume Part IV
Hi-index | 0.00 |
The main motivation for existing Grid systems is to provide mechanisms for sharing and accessing large and heterogeneous collections of remote resources. This remains the primary goal even today. However, achieving large-scale distributed computing in a seamless manner on Grid computing introduces not only the problem of efficient utilization and satisfactory response time but also the problem of fault-tolerance. With the momentum gaining for the Grid computing, the ability to tolerate failures while effectively exploiting the Grid computing resources in a scalable and transparent manner must be an integral part of Grid computing infrastructure. In this paper, we present a reconfigurable multi-layered Grid infrastructure that provides faulttolerance mechanisms to ensure that a Grid client can obtain reliable services, even if the middleware service that provides the desired services may suffer from crash failures.