Suffix Arrays for Multiple Strings: A Method for On-Line Multiple String Searches
ASIAN '96 Proceedings of the Second Asian Computing Science Conference on Concurrency and Parallelism, Programming, Networking, and Security
Replacing suffix trees with enhanced suffix arrays
Journal of Discrete Algorithms - SPIRE 2002
An extension of the Burrows–Wheeler Transform
Theoretical Computer Science
Compressed Suffix Trees with Full Functionality
Theory of Computing Systems
A New Combinatorial Approach to Sequence Comparison
Theory of Computing Systems
Space-Time Tradeoffs for Longest-Common-Prefix Array Computation
ISAAC '08 Proceedings of the 19th International Symposium on Algorithms and Computation
Permuted Longest-Common-Prefix Array
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Computing matching statistics and maximal exact matches on compressed full-text indexes
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Lightweight BWT construction for very large string collections
CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
WADS'11 Proceedings of the 12th international conference on Algorithms and data structures
Computing the longest common prefix array based on the burrows-wheeler transform
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Lightweight data indexing and compression in external memory
LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
Computing the longest common prefix array based on the Burrows-Wheeler transform
Journal of Discrete Algorithms
Lightweight algorithms for constructing and inverting the BWT of string collections
Theoretical Computer Science
Hi-index | 0.00 |
The advent of "next-generation" DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in bioinformatics. Knowing the longest common prefix array (LCP) of such a collection would facilitate the rapid computation of maximal exact matches, shortest unique substrings and shortest absent words. CPU-efficient algorithms for computing the LCP of a string have been described in the literature, but require the presence in RAM of large data structures. This prevents such methods from being feasible for NGS datasets. In this paper we propose the first lightweight method that simultaneously computes, via sequential scans, the LCP and BWT of very large collections of sequences. Computational results on collections as large as 800 million 100-mers demonstrate that our algorithm scales to the vast sequence collections encountered in human whole genome sequencing experiments.