Mining association rules between sets of items in large databases
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
On subsystems of a fuzzy finite state machine
Fuzzy Sets and Systems
Mean Shift: A Robust Approach Toward Feature Space Analysis
IEEE Transactions on Pattern Analysis and Machine Intelligence
The Vision of Autonomic Computing
Computer
Evaluation of an Economy-Based File Replication Strategy for a Data Grid
CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
An analysis of the behavior of a class of genetic adaptive systems.
An analysis of the behavior of a class of genetic adaptive systems.
Grid infrastructure monitoring system based on Nagios
Proceedings of the 2007 workshop on Grid monitoring
Autonomic management policy specification in Tune
Proceedings of the 2008 ACM symposium on Applied computing
A Proactive Non-Cooperative Game-Theoretic Framework for Data Replication in Data Grids
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Flexible and Efficient Workflow Deployment of Data-Intensive Applications On Grids With MOTEUR
International Journal of High Performance Computing Applications
Autonomic virtual resource management for service hosting platforms
CLOUD '09 Proceedings of the 2009 ICSE Workshop on Software Engineering Challenges of Cloud Computing
Analyzing the EGEE Production Grid Workload: Application to Jobs Submission Optimization
Job Scheduling Strategies for Parallel Processing
Scheduling Concurrent Bag-of-Tasks Applications on Heterogeneous Platforms
IEEE Transactions on Computers
Issues and scenarios for self-managing grid middleware
Proceedings of the 2nd workshop on Grids meets autonomic computing
On the use of computational geometry to detect software faults at runtime
Proceedings of the 7th international conference on Autonomic computing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Non-clairvoyant scheduling of multiple bag-of-tasks applications
EuroPar'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part I
P-GRADE portal family for grid infrastructures
Concurrency and Computation: Practice & Experience
IEEE Internet Computing
Online scheduling of workflow applications in grid environments
Future Generation Computer Systems
CCGRID '11 Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
Towards Non-Stationary Grid Models
Journal of Grid Computing
Self-Healing of Operational Workflow Incidents on Distributed Computing Infrastructures
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
WSCOM: Online Task Scheduling with Data Transfers
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
Toward fine-grained online task characteristics estimation in scientific workflows
WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
Hi-index | 0.00 |
Distributed computing infrastructures are commonly used through scientific gateways, but operating these gateways requires important human intervention to handle operational incidents. This paper presents a self-healing process that quantifies incident degrees of workflow activities from metrics measuring long-tail effect, application efficiency, data transfer issues, and site-specific problems. These metrics are simple enough to be computed online and they make little assumptions on the application or resource characteristics. From their degree, incidents are classified in levels and associated to sets of healing actions that are selected based on association rules modeling correlations between incident levels. We specifically study the long-tail effect issue, and propose a new algorithm to control task replication. The healing process is parametrized on real application traces acquired in production on the European Grid Infrastructure. Experimental results obtained in the Virtual Imaging Platform show that the proposed method speeds up execution up to a factor of 4, consumes up to 26% less resource time than a control execution and properly detects unrecoverable errors.