Lightweight LCP construction for next-generation sequencing datasets

Authors:
Markus J. Bauer;Anthony J. Cox;Giovanna Rosone;Marinella Sciortino
Affiliations:
Illumina Cambridge Ltd., United Kingdom;Illumina Cambridge Ltd., United Kingdom;Dipartimento di Matematica e Informatica, University of Palermo, Italy;Dipartimento di Matematica e Informatica, University of Palermo, Italy
Venue:
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Year:
2012

Citing 14
Cited 0

Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches

ASIAN '96 Proceedings of the Second Asian Computing Science Conference on Concurrency and Parallelism, Programming, Networking, and Security
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
An extension of the Burrows–Wheeler Transform

Theoretical Computer Science
Compressed Suffix Trees with Full Functionality

Theory of Computing Systems
A New Combinatorial Approach to Sequence Comparison

Theory of Computing Systems
Space-Time Tradeoffs for Longest-Common-Prefix Array Computation

ISAAC '08 Proceedings of the 19th International Symposium on Algorithms and Computation
Permuted Longest-Common-Prefix Array

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Computing matching statistics and maximal exact matches on compressed full-text indexes

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Lightweight BWT construction for very large string collections

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Inducing the LCP-array

WADS'11 Proceedings of the 12th international conference on Algorithms and data structures
Computing the longest common prefix array based on the burrows-wheeler transform

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Lightweight data indexing and compression in external memory

LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Computing the longest common prefix array based on the Burrows-Wheeler transform

Journal of Discrete Algorithms
Lightweight algorithms for constructing and inverting the BWT of string collections

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.