Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

Authors:
Rafael Ferreira da Silva;Tristan Glatard;Frederic Desprez
Affiliations:
-;-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 23
Cited 5

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
On subsystems of a fuzzy finite state machine

Fuzzy Sets and Systems
Mean Shift: A Robust Approach Toward Feature Space Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
The Vision of Autonomic Computing

Computer
Evaluation of an Economy-Based File Replication Strategy for a Data Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
An analysis of the behavior of a class of genetic adaptive systems.

An analysis of the behavior of a class of genetic adaptive systems.
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Parallel Computing
Grid infrastructure monitoring system based on Nagios

Proceedings of the 2007 workshop on Grid monitoring
Autonomic management policy specification in Tune

Proceedings of the 2008 ACM symposium on Applied computing
A Proactive Non-Cooperative Game-Theoretic Framework for Data Replication in Data Grids

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR

International Journal of High Performance Computing Applications
Autonomic virtual resource management for service hosting platforms

CLOUD '09 Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing
Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization

Job Scheduling Strategies for Parallel Processing
Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms

IEEE Transactions on Computers
Issues and scenarios for self-managing grid middleware

Proceedings of the 2nd workshop on Grids meets autonomic computing
On the use of computational geometry to detect software faults at runtime

Proceedings of the 7th international conference on Autonomic computing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Non-clairvoyant scheduling of multiple bag-of-tasks applications

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
P-GRADE portal family for grid infrastructures

Concurrency and Computation: Practice & Experience
Grid Computing Workloads

IEEE Internet Computing
Online scheduling of workflow applications in grid environments

Future Generation Computer Systems
The Grid Observatory

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Towards Non-Stationary Grid Models

Journal of Grid Computing

A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executions

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Workflow fairness control on online and non-clairvoyant distributed computing platforms

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
On-Line, non-clairvoyant optimization of workflow activity granularity on grids

Euro-Par'13 Proceedings of the 19th international conference on Parallel Processing
Self-healing of workflow activity incidents on distributed computing infrastructures

Future Generation Computer Systems
Characterizing workflow-based activity on a production e-infrastructure using provenance data

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. Incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Implementation and experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4 and properly detects unrecoverable errors.