Online workflow management and performance analysis with stampede

Authors:
Dan Gunter;Ewa Deelman;Taghrid Samak;Christopher H. Brooks;Monte Goode;Gideon Juve;Gaurang Mehta;Priscilla Moraes;Fabio Silva;Martin Swany;Karan Vahi
Affiliations:
Lawrence Berkeley National Laboratory, Berkeley, CA;University of Southern California, Marina Del Rey, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of San Francisco, San Francisco, CA;Lawrence Berkeley National Laboratory, Berkeley, CA;University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA;University of Delaware, Newark, DE;University of Southern California, Marina Del Rey, CA;University of Delaware, Newark, DE;University of Southern California, Marina Del Rey, CA
Venue:
Proceedings of the 7th International Conference on Network and Services Management
Year:
2011

Citing 24
Cited 6

Statecharts: A visual formalism for complex systems

Science of Computer Programming
Communicating sequential processes

Communications of the ACM
NetLogger: A Toolkit for Distributed System Performance Analysis

MASCOTS '00 Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
TelegraphCQ: continuous dataflow processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Performance metrics and ontologies for Grid workflows

Future Generation Computer Systems
ASKALON: A Grid Application Development and Computing Environment

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Toward a Commodity Enterprise Middleware

Queue - API Design
Connecting Scientific Data to Scientific Experiments with Provenance

E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Special Issue: The First Provenance Challenge

Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Exploring event correlation for failure prediction in coalitions of clusters

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Lightweight Middleware Monitor for Distributed Scientific Workflows

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Provenance: The Bridge Between Experiments and Data

Computing in Science and Engineering
Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows - Experiences from SCEC CyberShake

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
A taxonomy of grid monitoring systems

Future Generation Computer Systems
Large-scale incremental processing using distributed transactions and notifications

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Performance monitoring and visualization of grid scientific workflows in ASKALON

HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
PerfSONAR: a service oriented architecture for multi-domain network monitoring

ICSOC'05 Proceedings of the Third international conference on Service-Oriented Computing
Provenance collection support in the kepler scientific workflow system

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Performance evaluation of the karma provenance framework for scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

Failure prediction and localization in large scientific workflows

Proceedings of the 6th workshop on Workflows in support of large-scale science
Oozie: towards a scalable workflow management system for Hadoop

Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Failure analysis of distributed scientific workflows executing in the cloud

Proceedings of the 8th International Conference on Network and Service Management
Bundle and Pool Architecture for Multi-Language, Robust, Scalable Workflow Executions

Journal of Grid Computing
A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows

Journal of Grid Computing
Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workflows

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scientific workflows are an enabler of complex scientific analyses. They provide both a portable representation and a foundation upon which results can be validated and shared. Large-scale scientific workflows are executed on equally complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their workflows in real time, detect execution anomalies automatically, and perform troubleshooting -- without logging into remote nodes or searching through thousands of log files. As part of the NSF Stampede project, we have developed an infrastructure to answer these needs. The infrastructure captures application-level logs and resource information, normalizes these to standard representations, and stores these logs in a centralized general-purpose schema. Higher-level tools mine the logs in real time to determine current status, predict failures, and detect anomalous performance.