A Resource Management Architecture for Metacomputing Systems
IPPS/SPDP '98 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Adaptive Computing on the Grid Using AppLeS
IEEE Transactions on Parallel and Distributed Systems
Grids and grid technologies for wide-area distributed computing
Software—Practice & Experience
Nomadic Migration: Fault Tolerance in a Disruptive Grid Environment
CCGRID '02 Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid
A framework for adaptive execution in grids
Software—Practice & Experience
Self adaptivity in Grid computing: Research Articles
Concurrency and Computation: Practice & Experience - Grid Performance
The Cactus Worm: Experiments with Dynamic Resource Discovery and Allocation in a Grid Environment
International Journal of High Performance Computing Applications
DRIC: Dependable Grid Computing Framework
IEICE - Transactions on Information and Systems
Scientific Programming
A fault tolerance service for QoS in grid computing
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Migol: a fault-tolerant service framework for MPI applications in the grid
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Faults in large distributed systems and what we can do about them
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Parameter optimization in 3D reconstruction on a large scale grid
Parallel Computing
A comparison between two grid scheduling philosophies: EGEE WMS and Grid Way
Multiagent and Grid Systems - Grid Computing, high performance and distributed applications
Dynamic Provisioning of Virtual Clusters for Grid Computing
Euro-Par 2008 Workshops - Parallel Processing
Automatic replication of WSRF-based Grid services via operation providers
Future Generation Computer Systems
New challenges of parallel job scheduling
JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
A fault avoidance strategy improving the reliability of the EGI production grid infrastructure
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Future Generation Computer Systems
Grid Technology Reliability for Flash Flood Forecasting: End-user Assessment
Journal of Grid Computing
A comparative analysis between EGEE and grid way workload management systems
ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part II
Genetic algorithm calibration for two objective scheduling parallel jobs on hierarchical grids
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
A Computational Grid Scheduling Model To Maximize Reliability Using Modified GA
International Journal of Grid and High Performance Computing
A survey on reliability in distributed systems
Journal of Computer and System Sciences
Hi-index | 0.00 |
Reliability, in terms of Grid component fault tolerance and minimum quality of service, is an important aspect that has to be addressed to foster Grid technology adoption. Software reliability is critically important in today's integrated and distributed systems, as is often the weak link in system performance. In general, reliability is difficult to measure, and specially in Grid environments, where evaluation methodologies are novel and controversial matters. This paper describes a straightforward procedure to analyze the reliability of computational grids from the viewpoint of an end user. The procedure is illustrated in the evaluation of a research Grid infrastructure based on Globus basic services and the GridWay meta-scheduler. The GridWay support for fault tolerance is also demonstrated in a production-level environment. Results show that GridWay is a reliable workload management tool for dynamic and faulty Grid environments. Transparently to the end user, GridWay is able to detect and recover from any of the Grid element failure, outage and saturation conditions specified by the reliability analysis procedure.