Failure prediction and localization in large scientific workflows

Authors:
Taghrid Samak;Dan Gunter;Monte Goode;Ewa Deelman;Gaurang Mehta;Fabio Silva;Karan Vahi
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA
Venue:
Proceedings of the 6th workshop on Workflows in support of large-scale science
Year:
2011

Citing 15
Cited 6

Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Mining for misconfigured machines in grid systems

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
ASKALON: A Grid Application Development and Computing Environment

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Lightweight Middleware Monitor for Distributed Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows - Experiences from SCEC CyberShake

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Predicting the execution time of grid workflow applications through local learning

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A taxonomy of grid monitoring systems

Future Generation Computer Systems
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Performance monitoring and visualization of grid scientific workflows in ASKALON

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
PerfSONAR: a service oriented architecture for multi-domain network monitoring

ICSOC'05 Proceedings of the Third international conference on Service-Oriented Computing
Online workflow management and performance analysis with stampede

Proceedings of the 7th International Conference on Network and Services Management

Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
User-steering of HPC workflows: state-of-the-art and future directions

Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Characterizing workflow-based activity on a production e-infrastructure using provenance data

Future Generation Computer Systems
A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows

Journal of Grid Computing
Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workflows

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Runtime Dynamic Structural Changes of Scientific Workflows in Clouds

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflows provide a portable representation for scientific applications' coordinated input, output, and execution management for highly parallel executions of interdependent computations, as well as support for sharing and validating the results. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. Real-time execution monitoring provides a foundation for improving the transparency and resilience of the workflows in the face of stochastic and systematic faults. Building on previous work on early detection of these failure scenarios, we describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance. To complement this analysis, we also describe techniques for isolating systematic sources of failures. We evaluate our methods on a representative sample of large real-world workflows.