New indices for text: PAT Trees and PAT arrays
Information retrieval
Opportunistic data structures with applications
FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Journal of the ACM (JACM)
Linear work suffix array construction
Journal of the ACM (JACM)
A taxonomy of suffix array construction algorithms
ACM Computing Surveys (CSUR)
An extension of the Burrows–Wheeler Transform
Theoretical Computer Science
A New Combinatorial Approach to Sequence Comparison
Theory of Computing Systems
The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
The Burrows-Wheeler Transform: Data Compression, Suffix Arrays, and Pattern Matching
Linear Time Suffix Array Construction Using D-Critical Substrings
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Compressed Suffix Arrays for Massive Data
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Linear-time construction of suffix arrays
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Lightweight data indexing and compression in external memory
LATIN'10 Proceedings of the 9th Latin American conference on Theoretical Informatics
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
FEMTO: fast search of large sequence collections
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Comparing DNA sequence collections by direct comparison of compressed text indexes
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Lightweight LCP construction for next-generation sequencing datasets
WABI'12 Proceedings of the 12th international conference on Algorithms in Bioinformatics
Computing the longest common prefix array based on the Burrows-Wheeler transform
Journal of Discrete Algorithms
Lightweight algorithms for constructing and inverting the BWT of string collections
Theoretical Computer Science
Trends in suffix sorting: a survey of low memory algorithms
ACSC '12 Proceedings of the Thirty-fifth Australasian Computer Science Conference - Volume 122
Suffix Array Construction in External Memory Using D-Critical Substrings
ACM Transactions on Information Systems (TOIS)
Hi-index | 0.00 |
A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.