Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows

Authors:
Daniel De Oliveira;Kary A. C. S. OcañA;Eduardo Ogasawara;Jonas Dias;JoãO GonçAlves;Fernanda BaiãO;Marta Mattoso
Affiliations:
PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil and CEFET/RJ - Federal Center of Technological Education, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;UNIRIO-Federal University of the State of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Venue:
Future Generation Computer Systems
Year:
2013

Citing 37
Cited 0

Why and Where: A Characterization of Data Provenance

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
VisTrails: visualization meets data management

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

Bioinformatics
Provenance in databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Challenges in executing large parameter sweep studies across widely distributed computing environments

Proceedings of the 5th IEEE workshop on Challenges of large applications in distributed environments
Introducing secure provenance: problems and challenges

Proceedings of the 2007 ACM workshop on Storage security and survivability
Assessment of phylogenomic and orthology approaches for phylogenetic inference

Bioinformatics
Examining the Challenges of Scientific Workflows

Computer
Clustal W and Clustal X version 2.0

Bioinformatics
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Provenance and scientific workflows: challenges and opportunities

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Provenance for Computational Tasks: A Survey

Computing in Science and Engineering
Toward loosely coupled programming on petascale systems

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
A break in the clouds: towards a cloud definition

ACM SIGCOMM Computer Communication Review
Workflows and e-Science: An overview of workflow system features and capabilities

Future Generation Computer Systems
Optimizing user views for workflows

Proceedings of the 12th International Conference on Database Theory
Querying and Managing Provenance through User Views in Scientific Workflows

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
A MapReduce-Enabled Scientific Workflow Composition Framework

ICWS '09 Proceedings of the 2009 IEEE International Conference on Web Services
MapReduce: a flexible data processing tool

Communications of the ACM - Amir Pnueli: Ahead of His Time
Exploring many task computing in scientific workflows

Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Parallelization of the MAFFT multiple sequence alignment program

Bioinformatics
SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
Case study for running HPC applications in public clouds

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Data parallelism in bioinformatics workflows using Hydra

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Generating sound workflow views for correct provenance analysis

ACM Transactions on Database Systems (TODS)
Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Hybrid Computing-Where HPC meets grid and Cloud Computing

Future Generation Computer Systems
SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes

BSB'11 Proceedings of the 6th Brazilian conference on Advances in bioinformatics and computational biology
More convenient more overhead: the performance evaluation of Hadoop streaming

Proceedings of the 2011 ACM Symposium on Research in Applied Computation
Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2

Bioinformatics
Adapting scientific computing problems to clouds using MapReduce

Future Generation Computer Systems
An adaptive parallel execution strategy for cloud-based scientific workflows

Concurrency and Computation: Practice & Experience
A Provenance-based Adaptive Scheduling Heuristic for Parallel Scientific Workflows in Clouds

Journal of Grid Computing
The characteristics and performance of groups of jobs in grids

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
High performance cloud computing

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data analysis is an exploratory process that demands high performance computing (HPC). SciPhylomics, for example, is a data-intensive workflow that aims at producing phylogenomic trees based on an input set of protein sequences of genomes to infer evolutionary relationships among living organisms. SciPhylomics can benefit from parallel processing techniques provided by existing approaches such as SciCumulus cloud workflow engine and MapReduce implementations such as Hadoop. Despite some performance fluctuations, computing clouds provide a new dimension for HPC due to its elasticity and availability features. In this paper, we present a performance evaluation for SciPhylomics executions in a real cloud environment. The workflow was executed using two parallel execution approaches (SciCumulus and Hadoop) at the Amazon EC2 cloud. Our results reinforce the benefits of parallelizing data for the phylogenomic inference workflow using MapReduce-like parallel approaches in the cloud. The performance results demonstrate that this class of bioinformatics experiment is suitable to be executed in the cloud despite its need for high performance capabilities. The evaluated workflow shares many features of several data intensive workflows, which present first insights that these cloud execution results can be extrapolated to other classes of experiments.