Data parallelism in bioinformatics workflows using Hydra

Authors:
Fábio Coutinho;Eduardo Ogasawara;Daniel de Oliveira;Vanessa Braganholo;Alexandre A. B. Lima;Alberto M. R. Dávila;Marta Mattoso
Affiliations:
Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil;Oswaldo Cruz Institute -- FIOCRUZ -- Rio de Janeiro Brazil;Federal University of Rio de Janeiro - Rio de Janeiro -- Brazil
Venue:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Year:
2010

Citing 23
Cited 5

Component software: beyond object-oriented programming

Component software: beyond object-oriented programming
REP - ChaRacterizing and Exploiting Process Components: Results of Experimentation

WCRE '98 Proceedings of the Working Conference on Reverse Engineering (WCRE'98)
The Grid 2: Blueprint for a New Computing Infrastructure

The Grid 2: Blueprint for a New Computing Infrastructure
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Scheduling of scientific workflows in the ASKALON grid environment

ACM SIGMOD Record
Exploring Williams--Beuren syndrome using myGrid

Bioinformatics
Physical and Virtual Partitioning in OLAP Database Clusters

SBAC-PAD '05 Proceedings of the 17th International Symposium on Computer Architecture on High Performance Computing
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Challenges in executing large parameter sweep studies across widely distributed computing environments

Proceedings of the 5th IEEE workshop on Challenges of large applications in distributed environments
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Semantics-based distributed I/O for mpiBLAST

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
OrthoSearch: a scientific workflow approach to detect distant homologies on protozoans

Proceedings of the 2008 ACM symposium on Applied computing
Provenance for Computational Tasks: A Survey

Computing in Science and Engineering
Nimrod/K: towards massively parallel dynamic grid workflows

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Service-Oriented Architecture for VIEW: A Visual Scientific Workflow Management System

SCC '08 Proceedings of the 2008 IEEE International Conference on Services Computing - Volume 1
G-BLAST: a Grid-based solution for mpiBLAST on computational Grids

Concurrency and Computation: Practice & Experience
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
A MapReduce-Enabled Scientific Workflow Composition Framework

ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
OpenWP: Combining annotation language and workflow environments for porting existing applications on grids

GRID '08 Proceedings of the 2008 9th IEEE/ACM International Conference on Grid Computing
A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows

SCC '09 Proceedings of the 2009 IEEE International Conference on Services Computing
Exploring many task computing in scientific workflows

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
An opportunistic algorithm for scheduling workflows on grids

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science

A conceptual many tasks computing architecture to execute molecular docking simulations of a fully-flexible receptor model

BSB'11 Proceedings of the 6th Brazilian conference on Advances in bioinformatics and computational biology
A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Journal of Grid Computing
A framework for readapting and running bioinformatics applications in the cloud

Proceedings of the 2012 ACM Research in Applied Computation Symposium
Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows

Future Generation Computer Systems
Designing a parallel cloud based comparative genomics workflow to improve phylogenetic analyses

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large scale bioinformatics experiments are usually composed by a set of data flows generated by a chain of activities (programs or services) that may be modeled as scientific workflows. Current Scientific Workflow Management Systems (SWfMS) are used to orchestrate these workflows to control and monitor the whole execution. It is very common in bioinformatics experiments to process very large datasets. In this way, data parallelism is a common approach used to increase performance and reduce overall execution time. However, most of current SWfMS still lack on supporting parallel executions in high performance computing (HPC) environments. Additionally keeping track of provenance data in distributed environments is still an open, yet important problem. Recently, Hydra middleware was proposed to bridge the gap between the SWfMS and the HPC environment, by providing a transparent way for scientists to parallelize workflow executions while capturing distributed provenance. This paper analyzes data parallelism scenarios in bioinformatics domain and presents an extension to Hydra middleware through a specific cartridge that promotes data parallelism in bioinformatics workflows. Experimental results using workflows with BLAST show performance gains with the additional benefits of distributed provenance support.