Pairwise Data Clustering by Deterministic Annealing
IEEE Transactions on Pattern Analysis and Machine Intelligence
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
MapReduce for Data Intensive Scientific Analyses
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Bioinformatics
All-Pairs: An Abstraction for Data-Intensive Computing on Campus Grids
IEEE Transactions on Parallel and Distributed Systems
Biomedical Case Studies in Data Intensive Computing
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
DryadLINQ for Scientific Analyses
E-SCIENCE '09 Proceedings of the 2009 Fifth IEEE International Conference on e-Science
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Biomedical Case Studies in Data Intensive Computing
CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing
Skew-resistant parallel processing of feature-extracting scientific user-defined functions
Proceedings of the 1st ACM symposium on Cloud computing
The top ten cloud-security practices in next-generation networking
International Journal of Communication Networks and Distributed Systems
Investigation of data locality and fairness in MapReduce
Proceedings of third international workshop on MapReduce and its Applications Date
An autonomic cloud environment for hosting ECG data analysis services
Future Generation Computer Systems
Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
A framework for readapting and running bioinformatics applications in the cloud
Proceedings of the 2012 ACM Research in Applied Computation Symposium
BodyCloud: A SaaS approach for community Body Sensor Networks
Future Generation Computer Systems
Hi-index | 0.01 |
Executing large number of independent tasks or tasks that perform minimal inter-task communication in parallel is a common requirement in many domains. In this paper, we present our experience in applying two new Microsoft technologies Dryad and Azure to three bioinformatics applications. We also compare with traditional MPI and Apache Hadoop MapReduce implementation in one example. The applications are an EST (Expressed Sequence Tag) sequence assembly program, PhyloD statistical package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment application. We give detailed performance discussion on a 768 core Windows HPC Server cluster and an Azure cloud. All the applications start with a "doubly data parallel step" involving independent data chosen from two similar (EST, Alu) or two different databases (PhyloD). There are different structures for final stages in each application.