Online workflow management and performance analysis with stampede

  • Authors:
  • Dan Gunter;Ewa Deelman;Taghrid Samak;Christopher H. Brooks;Monte Goode;Gideon Juve;Gaurang Mehta;Priscilla Moraes;Fabio Silva;Martin Swany;Karan Vahi

  • Affiliations:
  • Lawrence Berkeley National Laboratory, Berkeley, CA;University of Southern California, Marina Del Rey, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of San Francisco, San Francisco, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA;University of Delaware, Newark, DE;University of Southern California, Marina Del Rey, CA;University of Delaware, Newark, DE;University of Southern California, Marina Del Rey, CA

  • Venue:
  • Proceedings of the 7th International Conference on Network and Services Management
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific workflows are an enabler of complex scientific analyses. They provide both a portable representation and a foundation upon which results can be validated and shared. Large-scale scientific workflows are executed on equally complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their workflows in real time, detect execution anomalies automatically, and perform troubleshooting -- without logging into remote nodes or searching through thousands of log files. As part of the NSF Stampede project, we have developed an infrastructure to answer these needs. The infrastructure captures application-level logs and resource information, normalizes these to standard representations, and stores these logs in a centralized general-purpose schema. Higher-level tools mine the logs in real time to determine current status, predict failures, and detect anomalous performance.