Failure prediction and localization in large scientific workflows

  • Authors:
  • Taghrid Samak;Dan Gunter;Monte Goode;Ewa Deelman;Gaurang Mehta;Fabio Silva;Karan Vahi

  • Affiliations:
  • Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;Lawrence Berkeley National Laboratory, Berkeley, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA;USC Information Sciences Institute, Marina Del Rey, CA, USA

  • Venue:
  • Proceedings of the 6th workshop on Workflows in support of large-scale science
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific workflows provide a portable representation for scientific applications' coordinated input, output, and execution management for highly parallel executions of interdependent computations, as well as support for sharing and validating the results. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. Real-time execution monitoring provides a foundation for improving the transparency and resilience of the workflows in the face of stochastic and systematic faults. Building on previous work on early detection of these failure scenarios, we describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance. To complement this analysis, we also describe techniques for isolating systematic sources of failures. We evaluate our methods on a representative sample of large real-world workflows.