An approach for pipelining nested collections in scientific workflows

  • Authors:
  • Timothy M. McPhillips;Shawn Bowers

  • Affiliations:
  • Natural Diversity Discovery Project;UC Davis Genome Center

  • Venue:
  • ACM SIGMOD Record
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe an approach for pipelining nested data collections in scientific workflows. Our approach logically delimits arbitrarily nested collections of data tokens using special, paired control tokens inserted into token streams, and provides workflow components with high-level operations for managing these collections. Our framework provides new capabilities for: (1) concurrent operation on collections; (2) on-the-fly customization of workflow component behavior; (3) improved handling of exceptions and faults; and (4) transparent passing of provenance and metadata within token streams. We demonstrate our approach using a workflow for inferring phylogenetic trees. We also describe future extensions to support richer typing mechanisms for facilitating sharing and reuse of workflow components between disciplines. This work represents a step towards our larger goal of exploiting collection-oriented dataflow programming as a new paradigm for scientific workflow systems, an approach we believe will significantly reduce the complexity of creating and reusing workflows and workflow components.