Efficient construction of an assembly string graph using the FM-index

Authors:
Jared T. Simpson;Richard Durbin
Affiliations:
-;-
Venue:
Bioinformatics
Year:
2010

Citing 0
Cited 11

Lightweight BWT construction for very large string collections

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Localized genome assembly from reads to scaffolds: practical traversal of the paired string graph

WABI'11 Proceedings of the 11th international conference on Algorithms in bioinformatics
Computing the longest common prefix array based on the burrows-wheeler transform

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Approximate all-pairs suffix/prefix overlaps

Information and Computation
Parallel and memory-efficient reads indexing for genome assembly

PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part II
Computing the burrows-wheeler transform of a string and its reverse

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Comparing DNA sequence collections by direct comparison of compressed text indexes

WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Computing the longest common prefix array based on the Burrows-Wheeler transform

Journal of Discrete Algorithms
Memory efficient minimum substring partitioning

Proceedings of the VLDB Endowment
Lightweight algorithms for constructing and inverting the BWT of string collections

Theoretical Computer Science
Computing the Burrows-Wheeler transform of a string and its reverse in parallel

Journal of Discrete Algorithms

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly. Contact: js18@sanger.ac.uk