Parallel short sequence mapping for high throughput genome sequencing

  • Authors:
  • Doruk Bozdag;Catalin C. Barbacioru;Umit V. Catalyurek

  • Affiliations:
  • The Ohio State University, Dept. of Biomedical Informatics, Columbus, 43210, USA;Applied Biosystems, 850 Lincoln Center Drive, Foster City, CA 94404, USA;The Ohio State University, Dept. of Biomedical Informatics, Columbus, 43210, USA

  • Venue:
  • IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the advent of next-generation high throughput sequencing instruments, large volumes of short sequence data are generated at an unprecedented rate. Processing and analyzing these massive data requires overcoming several challenges including mapping of generated short sequences to a reference genome. This computationally intensive process takes time on the order of days using existing sequential techniques on large scale datasets. In this work, we propose six parallelization methods to speedup short sequence mapping and to reduce the execution time under just a few hours for such large datasets. We comparatively present these methods and give theoretical cost models for each method. Experimental results on real datasets demonstrate the effectiveness of the parallel methods and indicate that the cost models help accurate estimation of parallel execution time. Based on these cost models we implemented a selection function to predict the best method for a given scenario. To the best of our knowledge this is the first study on parallelization of short sequence mapping problem.