Cloud technologies for bioinformatics applications

Authors:
Xiaohong Qiu;Jaliya Ekanayake;Scott Beason;Thilina Gunarathne;Geoffrey Fox;Roger Barga;Dennis Gannon
Affiliations:
Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Indiana University, Bloomington, IN;Microsoft Research, Microsoft Corporation, Redmond, WA;Microsoft Research, Microsoft Corporation, Redmond, WA
Venue:
Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
Year:
2009

Citing 8
Cited 8

Pairwise Data Clustering by Deterministic Annealing

IEEE Transactions on Pattern Analysis and Machine Intelligence
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce for Data Intensive Scientific Analyses

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
CloudBurst

Bioinformatics
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids

IEEE Transactions on Parallel and Distributed Systems
Biomedical Case Studies in Data Intensive Computing

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
DryadLINQ for Scientific Analyses

E-SCIENCE '09 Proceedings of the 2009 Fifth IEEE International Conference on e-Science
DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Biomedical Case Studies in Data Intensive Computing

CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
The top ten cloud-security practices in next-generation networking

International Journal of Communication Networks and Distributed Systems
Investigation of data locality and fairness in MapReduce

Proceedings of third international workshop on MapReduce and its Applications Date
An autonomic cloud environment for hosting ECG data analysis services

Future Generation Computer Systems
Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A framework for readapting and running bioinformatics applications in the cloud

Proceedings of the 2012 ACM Research in Applied Computation Symposium
BodyCloud: A SaaS approach for community Body Sensor Networks

Future Generation Computer Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.