A fault avoidance strategy improving the reliability of the EGI production grid infrastructure

Authors:
Francesco Palmieri;Silvio Pardi;Paolo Veronesi
Affiliations:
Università degli studi di Napoli Federico II, Napoli, Italy;INFN Sezione di Napoli and INDAM, Napoli, Italy;INFN CNAF, Bologna, Italy
Venue:
OPODIS'10 Proceedings of the 14th international conference on Principles of distributed systems
Year:
2010

Citing 10
Cited 1

Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
Fault Tolerant Wide-Area Parallel Computing

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
A Fault Detection Service for Wide Area Distributed Computations

HPDC '98 Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing
Fault Tolerant Computing on the Grid: What are My Options?

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A Monitoring Sensor Management System for Grid Environments

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
GridWorkflow: A Flexible Failure Handling Framework for the Grid

HPDC '03 Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing
The Anatomy of the Grid: Enabling Scalable Virtual Organizations

International Journal of High Performance Computing Applications
Fault-tolerant grid resource management infrastructure

Neural, Parallel & Scientific Computations - Special issue: Grid computing
Evaluating the reliability of computational grids from the end user's point of view

Journal of Systems Architecture: the EUROMICRO Journal
Reliability in grid computing systems

Concurrency and Computation: Practice & Experience - A Special Issue from the Open Grid Forum

A grid monitoring model over network-aware IaaS cloud infrastructure

International Journal of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Reliability is a crucial issue for the development of stable and effective production grid infrastructures. That is, grid users must be able to trust upon the runtime service they request and receive from the underlying grid. Many runtime services and capabilities offered by modern Grid infrastructures are not available in advance to the application developers and dynamically bound only at the execution time, leading to an increased incidence of interaction faults. In this work we propose, implement and evaluate a novel low-impact fault-avoidance scheme, specifically conceived to improve the grid reliability from the user/application point of view, by providing proper service status information to the workload management system. In particular, starting from the EGEE experience, we designed a strategy inhibiting the use of some specific runtime capabilities on the available resources as soon as the monitoring system detect any anomalous behavior associated to these capabilities and re-integrating them when they restart to correctly work again. The results of a significant set of tests ran on the production EGEE infrastructure, have been presented to show the effectiveness of our approach.