Parallel short sequence mapping for high throughput genome sequencing

Authors:
Doruk Bozdag;Catalin C. Barbacioru;Umit V. Catalyurek
Affiliations:
The Ohio State University, Dept. of Biomedical Informatics, Columbus, 43210, USA;Applied Biosystems, 850 Lincoln Center Drive, Foster City, CA 94404, USA;The Ohio State University, Dept. of Biomedical Informatics, Columbus, 43210, USA
Venue:
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Year:
2009

Citing 0
Cited 4

A Comprehensive Analysis Workflow for Genome-Wide Screening Data from ChIP-Sequencing Experiments

BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
A moldable online scheduling algorithm and its application to parallel short sequence mapping

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Optimizing the stretch of independent tasks on a cluster: From sequential tasks to moldable tasks

Journal of Parallel and Distributed Computing
Masher: Mapping Long(er) Reads with Hash-based Genome Indexing on GPUs

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of next-generation high throughput sequencing instruments, large volumes of short sequence data are generated at an unprecedented rate. Processing and analyzing these massive data requires overcoming several challenges including mapping of generated short sequences to a reference genome. This computationally intensive process takes time on the order of days using existing sequential techniques on large scale datasets. In this work, we propose six parallelization methods to speedup short sequence mapping and to reduce the execution time under just a few hours for such large datasets. We comparatively present these methods and give theoretical cost models for each method. Experimental results on real datasets demonstrate the effectiveness of the parallel methods and indicate that the cost models help accurate estimation of parallel execution time. Based on these cost models we implemented a selection function to predict the best method for a given scenario. To the best of our knowledge this is the first study on parallelization of short sequence mapping problem.