Self-healing of workflow activity incidents on distributed computing infrastructures

Authors:
Rafael Ferreira Da Silva;Tristan Glatard;Frédéric Desprez
Affiliations:
-;-;-
Venue:
Future Generation Computer Systems
Year:
2013

Citing 26
Cited 1

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
On subsystems of a fuzzy finite state machine

Fuzzy Sets and Systems
Mean Shift: A Robust Approach Toward Feature Space Analysis

IEEE Transactions on Pattern Analysis and Machine Intelligence
The Vision of Autonomic Computing

Computer
Evaluation of an Economy-Based File Replication Strategy for a Data Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
An analysis of the behavior of a class of genetic adaptive systems.

An analysis of the behavior of a class of genetic adaptive systems.
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems

Parallel Computing
Grid infrastructure monitoring system based on Nagios

Proceedings of the 2007 workshop on Grid monitoring
Autonomic management policy specification in Tune

Proceedings of the 2008 ACM symposium on Applied computing
A Proactive Non-Cooperative Game-Theoretic Framework for Data Replication in Data Grids

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR

International Journal of High Performance Computing Applications
Autonomic virtual resource management for service hosting platforms

CLOUD '09 Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing
Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization

Job Scheduling Strategies for Parallel Processing
Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms

IEEE Transactions on Computers
Issues and scenarios for self-managing grid middleware

Proceedings of the 2nd workshop on Grids meets autonomic computing
On the use of computational geometry to detect software faults at runtime

Proceedings of the 7th international conference on Autonomic computing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Non-clairvoyant scheduling of multiple bag-of-tasks applications

EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
P-GRADE portal family for grid infrastructures

Concurrency and Computation: Practice & Experience
Grid Computing Workloads

IEEE Internet Computing
Online scheduling of workflow applications in grid environments

Future Generation Computer Systems
The Grid Observatory

CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Towards Non-Stationary Grid Models

Journal of Grid Computing
Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
WSCOM: Online Task Scheduling with Data Transfers

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executions

Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops

Toward fine-grained online task characteristics estimation in scientific workflows

WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.