Data parallelism in bioinformatics workflows using Hydra

  • Authors:
  • Fábio Coutinho;Eduardo Ogasawara;Daniel de Oliveira;Vanessa Braganholo;Alexandre A. B. Lima;Alberto M. R. Dávila;Marta Mattoso

  • Affiliations:
  • Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Oswaldo Cruz Institute -- FIOCRUZ -- Rio de Janeiro Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil

  • Venue:
  • Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Large scale bioinformatics experiments are usually composed by a set of data flows generated by a chain of activities (programs or services) that may be modeled as scientific workflows. Current Scientific Workflow Management Systems (SWfMS) are used to orchestrate these workflows to control and monitor the whole execution. It is very common in bioinformatics experiments to process very large datasets. In this way, data parallelism is a common approach used to increase performance and reduce overall execution time. However, most of current SWfMS still lack on supporting parallel executions in high performance computing (HPC) environments. Additionally keeping track of provenance data in distributed environments is still an open, yet important problem. Recently, Hydra middleware was proposed to bridge the gap between the SWfMS and the HPC environment, by providing a transparent way for scientists to parallelize workflow executions while capturing distributed provenance. This paper analyzes data parallelism scenarios in bioinformatics domain and presents an extension to Hydra middleware through a specific cartridge that promotes data parallelism in bioinformatics workflows. Experimental results using workflows with BLAST show performance gains with the additional benefits of distributed provenance support.