Designing a parallel cloud based comparative genomics workflow to improve phylogenetic analyses

Authors:
Kary A. C. S. Ocaña;Daniel De Oliveira;Jonas Dias;Eduardo Ogasawara;Marta Mattoso
Affiliations:
-;-;-;-;-
Venue:
Future Generation Computer Systems
Year:
2013

Citing 17
Cited 0

Fast text searching: allowing errors

Communications of the ACM
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

Bioinformatics
Challenges in executing large parameter sweep studies across widely distributed computing environments

Proceedings of the 5th IEEE workshop on Challenges of large applications in distributed environments
Provenance for Computational Tasks: A Survey

Computing in Science and Engineering
A break in the clouds: towards a cloud definition

ACM SIGCOMM Computer Communication Review
SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
Data parallelism in bioinformatics workflows using Hydra

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Basics of Software Engineering Experimentation

Basics of Software Engineering Experimentation
An efficient weighted bi-objective scheduling algorithm for heterogeneous systems

Parallel Computing
SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes

BSB'11 Proceedings of the 6th Brazilian conference on Advances in bioinformatics and computational biology
A Performance Evaluation of X-Ray Crystallography Scientific Workflow Using SciCumulus

CLOUD '11 Proceedings of the 2011 IEEE 4th International Conference on Cloud Computing
Towards a Cost Model for Scheduling Scientific Workflows Activities in Cloud Environments

SERVICES '11 Proceedings of the 2011 IEEE World Congress on Services
Optimizing Phylogenetic Analysis Using SciHmm Cloud-based Scientific Workflow

ESCIENCE '11 Proceedings of the 2011 IEEE Seventh International Conference on eScience
An adaptive parallel execution strategy for cloud-based scientific workflows

Concurrency and Computation: Practice & Experience

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the last years, comparative genomics analyses have become more compute-intensive due to the explosive number of available genome sequences. Comparative genomics analysis is an important a prioristep for experiments in various bioinformatics domains. This analysis can be used to enhance the performance and quality of experiments in areas such as evolution and phylogeny. A common phylogenetic analysis makes extensive use of Multiple Sequence Alignment (MSA) in the construction of phylogenetic trees, which are used to infer evolutionary relationships between homologous genes. Each phylogenetic analysis aims at exploring several different MSA methods to verify which execution produces trees with the best quality. This phylogenetic exploration may run during weeks, even when executed in High Performance Computing (HPC) environments. Although there are many approaches that model and parallelize phylogenetic analysis as scientific workflows, exploring all MSA methods becomes a complex and expensive task to be performed. If scientists determine a priorithe most adequate MSA method to use in the phylogenetic analysis, it would save time, and, in some cases, financial resources. Comparative genomics analyses play an important role in optimizing phylogenetic analysis workflows. In this paper, we extend the SciHmm scientific workflow, aimed at determining the most suitable MSA method, to use it in a phylogenetic analysis. SciHmm uses SciCumulus, a cloud workflow execution engine, for parallel execution. Experimental results show that using SciHmm considerably reduces the total execution time of the phylogenetic analysis (up to 80%). Experiments also show that trees built with the MSA program elected by using SciHmm presented more quality than the remaining, as expected. In addition, the parallel execution of SciHmm shows that this kind of bioinformatics workflow has an excellent cost/benefit when executed in cloud environments.