Characterizing deep sequencing analytics using BFAST: towards a scalable distributed architecture for next-generation sequencing data

Authors:
Joohyun Kim;Sharath Maddineni;Shantenu Jha
Affiliations:
Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA;Louisiana State University, Baton Rouge, LA, USA
Venue:
Proceedings of the second international workshop on Emerging computational methods for the life sciences
Year:
2011

Citing 9
Cited 5

Genome Sequence Assembly: Algorithms and Issues

Computer
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Review: Sequence assembly

Computational Biology and Chemistry
Editorial

Bioinformatics
CloudBurst

Bioinformatics
Developing Scientific Applications with Loosely-Coupled Sub-tasks

ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
The Sequence Alignment/Map format and SAMtools

Bioinformatics
SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Exploring the RNA folding energy landscape using scalable distributed cyberinfrastructure

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Building gateways for life-science applications using the dynamic application runtime environment (DARE) framework

Proceedings of the 2011 TeraGrid Conference: Extreme Digital Discovery
The anatomy of successful ECSS projects: lessons of supporting high-throughput high-performance ensembles on XSEDE

Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond
Distributed Application Runtime Environment (DARE): A Standards-based Middleware Framework for Science-Gateways

Journal of Grid Computing
Understanding mapreduce-based next-generation sequencing alignment on distributed cyberinfrastructure

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences
Workflow as a service: an approach to workflow farming

Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Next Generation DNA Sequencing platforms produce significantly larger amounts of data compared to early Sanger technology sequencers. In addition to the challenges of data-management that arise from unprecedented volumes of data, there exists the important requirement of effectively analyzing the data. In this paper, we use BFAST -- genome-wide mapping application, as a representative example of the typical analysis that is required on data from NGS machines. We investigate two model genomes -- human genome and a microbe (Burkerholderia Glumae), that represent an eukaryotic and a prokaryotic system. The computational complexity of genome-wide mapping using BFAST, amongst other factors depends upon the size of a reference genome, the data size of short reads. We analyze the performance characteristics of BFAST and understand its dependency on different input parameters. Characterizing the performance suggests that genome-wide mapping benefits from both scaling-up (increased fine-grained parallelism) and scaling-out (task-level parallelism -- local and distributed). For certain problem instances, scaling-out can be a more efficient approach than scaling-up. We then design, develop and demonstrate a runtime-environment that supports both the scale-up and scale-out of BFAST on production grid and cloud environments.