Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

Authors:
Taghrid Samak;Dan Gunter;Monte Goode;Ewa Deelman;Gideon Juve;Gaurang Mehta;Fabio Silva;Karan Vahi
Affiliations:
-;-;-;-;-;-;-;-
Venue:
HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Year:
2011

Citing 0
Cited 5

Failure prediction and localization in large scientific workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Online workflow management and performance analysis with stampede

Proceedings of the 7th International Conference on Network and Services Management
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
Bundle and Pool Architecture for Multi-Language, Robust, Scalable Workflow Executions

Journal of Grid Computing
A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflows are an enabler of complex scientific analyses. Large-scale scientific workflows are executed on complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their workflows in real time, detect execution anomalies automatically, and perform troubleshooting--without logging into remote nodes or searching through thousands of log files. As part of the NSF-funded Synthesized Tools for Archiving Monitoring Performance and Enhanced DEbugging (STAMPEDE) project, we have developed an infrastructure to answer these needs by integrating detailed workflow and resource monitoring. On top of this infrastructure, we have developed analysis techniques for online detection of a wide variety of "hard" and "soft" types of failures. We use these detected failures to derive higher-level statistics about the status of the resources and the workflow as a whole. In this paper, we describe our techniques and evaluate their effectiveness in the context of real application logs.