Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Reducing the space requirement of suffix trees
Software—Practice & Experience
Succinct representations of lcp information and improvements in the compressed suffix arrays
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Compact suffix array: a space-efficient full-text index
Fundamenta Informaticae - Special issue on computing patterns in strings
Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching
SIAM Journal on Computing
ACM Computing Surveys (CSUR)
Theoretical Computer Science
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets
ACM Transactions on Algorithms (TALG)
Fast BWT in small space by blockwise suffix sorting
Theoretical Computer Science
Algorithms and data structures for external memory
Foundations and Trends® in Theoretical Computer Science
Better external memory suffix array construction
Journal of Experimental Algorithmics (JEA)
Space Efficient String Mining under Frequency Constraints
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Permuted Longest-Common-Prefix Array
CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Engineering a compressed suffix tree implementation
Journal of Experimental Algorithmics (JEA)
A Linear-Time Burrows-Wheeler Transform Using Induced Sorting
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices
SIAM Journal on Computing
Sampled longest common prefix array
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Computing matching statistics and maximal exact matches on compressed full-text indexes
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
DNA compression challenge revisited: a dynamic programming approach
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
On enumerating the DNA sequences
Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Space-Efficient computation of maximal and supermaximal repeats in genome sequences
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
RACE: a scalable and elastic parallel system for discovering repeats in very long sequences
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.