Succinct data structures for assembling large genomes

Authors:
Thomas C. Conway;Andrew J. Bromage
Affiliations:
-;-
Venue:
Bioinformatics
Year:
2011

Citing 0
Cited 4

Parallel and memory-efficient reads indexing for genome assembly

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Succinct de bruijn graphs

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Space-efficient and exact de bruijn graph representation based on a bloom filter

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Memory efficient minimum substring partitioning

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Second-generation sequencing technology makes it feasible for many researches to obtain enough sequence reads to attempt the de novo assembly of higher eukaryotes (including mammals). De novo assembly not only provides a tool for understanding wide scale biological variation, but within human biomedicine, it offers a direct way of observing both large-scale structural variation and fine-scale sequence variation. Unfortunately, improvements in the computational feasibility for de novo assembly have not matched the improvements in the gathering of sequence data. This is for two reasons: the inherent computational complexity of the problem and the in-practice memory requirements of tools. Results: In this article, we use entropy compressed or succinct data structures to create a practical representation of the de Bruijn assembly graph, which requires at least a factor of 10 less storage than the kinds of structures used by deployed methods. Moreover, because our representation is entropy compressed, in the presence of sequencing errors it has better scaling behaviour asymptotically than conventional approaches. We present results of a proof-of-concept assembly of a human genome performed on a modest commodity server. Availability: Binaries of programs for constructing and traversing the de Bruijn assembly graph are available from http://www.genomics.csse.unimelb.edu.au/succinctAssembly. Contact: tom.conway@nicta.com.au Supplementary information:Supplementary data are available at Bioinformatics online.