Large-Scale DNA sequence analysis in the cloud: a stream-based approach

Authors:
Romeo Kienzler;Rémy Bruggmann;Anand Ranganathan;Nesime Tatbul
Affiliations:
Department of Computer Science, ETH Zurich, Switzerland;Bioinformatics, Department of Biology, University of Berne, Switzerland;IBM T.J. Watson Research Center, NY;Department of Computer Science, ETH Zurich, Switzerland
Venue:
Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Year:
2011

Citing 5
Cited 2

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
On spaced seeds for similarity search

Discrete Applied Mathematics
SPADE: the system s declarative stream processing engine

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
CloudBurst

Bioinformatics
Bioinformatics challenges for personalized medicine

Bioinformatics

Incremental DNA sequence analysis in the cloud

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Speeding-up codon analysis on the cloud with local MapReduce aggregation

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud computing technologies have made it possible to analyze big data sets in scalable and cost-effective ways. DNA sequence analysis, where very large data sets are now generated at reduced cost using the Next-Generation Sequencing (NGS) methods, is an area which can greatly benefit from cloud-based infrastructures. Although existing solutions show nearly linear scalability, they pose significant limitations in terms of data transfer latencies and cloud storage costs. In this paper, we propose to tackle the performance problems that arise from having to transfer large amounts of data between clients and the cloud based on a streaming data management architecture. Our approach provides an incremental data processing model which can hide data transfer latencies while maintaining linear scalability. We present an initial implementation and evaluation of this approach for SHRiMP, a well-known software package for NGS read alignment, based on the IBM InfoSphere Streams computing platform deployed on Amazon EC2.