Scientific workflow design 2.0: Demonstrating streaming data collections in Kepler

  • Authors:
  • Lei Dou;Daniel Zinn;Timothy McPhillips;Sven Kohler;Sean Riddle;Shawn Bowers;Bertram Ludascher

  • Affiliations:
  • UC Davis Genome Center, University of California, Davis, 95616, USA;UC Davis Genome Center, University of California, Davis, 95616, USA;UC Davis Genome Center, University of California, Davis, 95616, USA;UC Davis Genome Center, University of California, Davis, 95616, USA;UC Davis Genome Center, University of California, Davis, 95616, USA;Department of Computer Science, Gonzaga University, Spokane, WA 99258, USA;UC Davis Genome Center, University of California, Davis, 95616, USA

  • Venue:
  • ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific workflow systems are used to integrate existing software components (actors) into larger analysis pipelines to perform in silico experiments. Current approaches for handling data in nested-collection structures, as required in many scientific domains, lead to many record-management actors (shims) that make the workflow structure overly complex, and as a consequence hard to construct, evolve and maintain. By constructing and executing workflows from bioinformatics and geosciences in the Kepler system, we will demonstrate how COMAD (Collection-Oriented Modeling and Design), an extension of conventional workflow design, addresses these shortcomings. In particular, COMAD provides a hierarchical data stream model (as in XML) and a novel declarative configuration language for actors that functions as a middleware layer between the workflow's data model (streaming nested collections) and the actor's data model (base data and lists thereof). Our approach allows actor developers to focus on the internal actor processing logic oblivious to the workflow structure. Actors can then be re-used in various workflows simply by adapting actor configurations. Due to streaming nested collections and declarative configurations, COMAD workflows can usually be realized as linear data processing pipelines, which often reflect the scientific data analysis intention better than conventional designs. This linear structure not only simplifies actor insertions and deletions (workflow evolution), but also decreases the overall complexity of the workflow, reducing future effort in maintenance.