Characterizing deep sequencing analytics using BFAST: towards a scalable distributed architecture for next-generation sequencing data

  • Authors:
  • Joohyun Kim;Sharath Maddineni;Shantenu Jha

  • Affiliations:
  • Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA

  • Venue:
  • Proceedings of the second international workshop on Emerging computational methods for the life sciences
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Next Generation DNA Sequencing platforms produce significantly larger amounts of data compared to early Sanger technology sequencers. In addition to the challenges of data-management that arise from unprecedented volumes of data, there exists the important requirement of effectively analyzing the data. In this paper, we use BFAST -- genome-wide mapping application, as a representative example of the typical analysis that is required on data from NGS machines. We investigate two model genomes -- human genome and a microbe (Burkerholderia Glumae), that represent an eukaryotic and a prokaryotic system. The computational complexity of genome-wide mapping using BFAST, amongst other factors depends upon the size of a reference genome, the data size of short reads. We analyze the performance characteristics of BFAST and understand its dependency on different input parameters. Characterizing the performance suggests that genome-wide mapping benefits from both scaling-up (increased fine-grained parallelism) and scaling-out (task-level parallelism -- local and distributed). For certain problem instances, scaling-out can be a more efficient approach than scaling-up. We then design, develop and demonstrate a runtime-environment that supports both the scale-up and scale-out of BFAST on production grid and cloud environments.