Performance evaluation of parallel strategies in public clouds: A study with phylogenomic workflows

  • Authors:
  • Daniel De Oliveira;Kary A. C. S. OcañA;Eduardo Ogasawara;Jonas Dias;JoãO GonçAlves;Fernanda BaiãO;Marta Mattoso

  • Affiliations:
  • PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil and CEFET/RJ - Federal Center of Technological Education, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;UNIRIO-Federal University of the State of Rio de Janeiro, Rio de Janeiro, Brazil;PESC/COPPE - Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

  • Venue:
  • Future Generation Computer Systems
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data analysis is an exploratory process that demands high performance computing (HPC). SciPhylomics, for example, is a data-intensive workflow that aims at producing phylogenomic trees based on an input set of protein sequences of genomes to infer evolutionary relationships among living organisms. SciPhylomics can benefit from parallel processing techniques provided by existing approaches such as SciCumulus cloud workflow engine and MapReduce implementations such as Hadoop. Despite some performance fluctuations, computing clouds provide a new dimension for HPC due to its elasticity and availability features. In this paper, we present a performance evaluation for SciPhylomics executions in a real cloud environment. The workflow was executed using two parallel execution approaches (SciCumulus and Hadoop) at the Amazon EC2 cloud. Our results reinforce the benefits of parallelizing data for the phylogenomic inference workflow using MapReduce-like parallel approaches in the cloud. The performance results demonstrate that this class of bioinformatics experiment is suitable to be executed in the cloud despite its need for high performance capabilities. The evaluated workflow shares many features of several data intensive workflows, which present first insights that these cloud execution results can be extrapolated to other classes of experiments.