A semantic framework for automatic generation of computational workflows using distributed data and component catalogues

  • Authors:
  • Yolanda Gil;Pedro A. Gonzalez-Calero;Jihie Kim;Joshua Moody;Varun Ratnakar

  • Affiliations:
  • Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA;Facultad de Informatica, Universidad Complutense de Madrid, 28040 Madrid, Spain;Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA;Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA;Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA

  • Venue:
  • Journal of Experimental & Theoretical Artificial Intelligence
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Computational workflows are a powerful paradigm to represent and manage complex applications, particularly in large-scale distributed scientific data analysis. Workflows represent application components that result in individual computations as well as their interdependences in terms of dataflow. Workflow systems use these representations to manage various aspects of workflow creation and execution for users, such as the automatic assignment of execution resources. This article describes an approach to automating a new aspect of the process: the selection of application components and data sources. We present a novel approach that enables users to specify varying degrees of detail and amount of constraints in a workflow request, including the specification of constraints on input, intermediate or output data in the workflow, abstract workflow component classes rather than specific component implementations, and generic reusable workflow templates that express a pre-defined combination of components. The algorithm elaborates the user request into a set of fully ground workflows with specific choices of data sources and codes to be used so that they can be submitted for mapping and execution. The algorithm searches through the space of possible candidate workflows by creating increasingly more specialized versions of the original template and eliminating candidates that violate constraints cumulated in the candidate workflow as components and data sources are selected. A novel feature of our approach is that it assumes a distributed architecture where data and component catalogues are separate from the workflow system. The algorithm explicitly poses queries to external catalogues, and therefore any reasoning regarding data or component properties is not assumed to occur within the workflow system. We describe our implementation of this approach in the Wings workflow system. This implementation uses the W3C Web Ontology Language and associated reasoners to implement the workflow system as well as the data and component catalogues. This research demonstrates the use of artificial intelligence techniques to support the kinds of automation envisioned by the scientific community for large-scale distributed scientific data analysis.