Scientific workflow design with data assembly lines

Authors:
Daniel Zinn;Shawn Bowers;Timothy McPhillips;Bertram Ludäscher
Affiliations:
University of California, Davis;Gonzaga University;UC Davis Genome Center;University of California, Davis and UC Davis Genome Center
Venue:
Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science
Year:
2009

Citing 20
Cited 1

Toward a Common Component Architecture for High-Performance Scientific Computing

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
An approach for pipelining nested collections in scientific workflows

ACM SIGMOD Record
Enabling ScientificWorkflow Reuse through Structured Composition of Dataflow and Control-Flow

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Workflows for e-Science: Scientific Workflows for Grids

Workflows for e-Science: Scientific Workflows for Grids
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
Workflow automation for processing plasma fusion simulation data

Proceedings of the 2nd workshop on Workflows in support of large-scale science
On the relationship between workflow models and document types

Information Systems
VisComplete: Automating Suggestions for Visualization Pipelines

IEEE Transactions on Visualization and Computer Graphics
Workflows and e-Science: An overview of workflow system features and capabilities

Future Generation Computer Systems
Scientific workflow design for mere mortals

Future Generation Computer Systems
The design and realisation of the Experimentmy Virtual Research Environment for social sharing of workflows

Future Generation Computer Systems
X-CSR: Dataflow Optimization for Distributed XML Process Pipelines

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Wings for Pegasus: creating large-scale scientific applications using semantic representations of computational workflows

IAAI'07 Proceedings of the 19th national conference on Innovative applications of artificial intelligence - Volume 2
A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows

SCC '09 Proceedings of the 2009 IEEE International Conference on Services Computing
Petri net + nested relational calculus = dataflow

OTM'05 Proceedings of the 2005 Confederated international conference on On the Move to Meaningful Internet Systems - Volume >Part I
Collection-Oriented scientific workflows for integrating and analyzing biological data

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
RAxML-OMP: an efficient program for phylogenetic inference on SMPs

PaCT'05 Proceedings of the 8th international conference on Parallel Computing Technologies
Actor-oriented design of scientific workflows

ER'05 Proceedings of the 24th international conference on Conceptual Modeling
Managing rapidly-evolving scientific workflows

IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data

RECYCLE: Learning looping workflows from annotated traces

ACM Transactions on Intelligent Systems and Technology (TIST)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite an increasing interest in scientific workflow technologies in recent years, workflow design remains a challenging, slow, and often error-prone process, thus limiting the speed of further adoption of scientific workflows. Based on practical experience with data-driven workflows, we identify and illustrate a number of recurring scientific workflow design challenges, i.e., parameter-rich functions; data assembly, disassembly, and cohesion; conditional execution; iteration; and, more generally, workflow evolution. In conventional approaches, such challenges usually lead to the introduction of different types of "shims", i.e., intermediary workflow steps that act as adapters between otherwise incorrectly wired components. However, relying heavily on the use of shims leads to brittle (i.e., change-intolerant) workflow designs that are hard to comprehend and maintain. To this end, we present a general workflow design paradigm called virtual data assembly lines (VDAL). In this paper, we show how the VDAL approach can overcome common scientific workflow design challenges and improve workflow designs by exploiting (i) a semistructured, nested data model like XML, (ii) a flexible, statically analyzable configuration mechanism (e.g., an XQuery fragment), and (iii) an underlying virtual assembly line model that is resilient to workflow and data changes. The approach has been implemented as Kepler/COMAD, and applied to improve the design of complex, real-world workflows.