Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree

Authors:
M. Oguzhan Kulekci;Jeffrey Scott Vitter;Bojian Xu
Affiliations:
National Research Institute of Electronics and Cryptology, Gebze;The University of Kansas, Lawrence;Eastern Washington University, Cheney
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2012

Citing 24
Cited 3

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Reducing the space requirement of suffix trees

Software—Practice & Experience
Succinct representations of lcp information and improvements in the compressed suffix arrays

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
High-order entropy-compressed text indexes

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
Indexing compressed text

Journal of the ACM (JACM)
Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

SIAM Journal on Computing
Compressed full-text indexes

ACM Computing Surveys (CSUR)
A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays

Algorithmica
Alphabet-independent linear-time construction of compressed suffix arrays using o(nlogn)-bit working space

Theoretical Computer Science
Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets

ACM Transactions on Algorithms (TALG)
Fast BWT in small space by blockwise suffix sorting

Theoretical Computer Science
Algorithms and data structures for external memory

Foundations and Trends® in Theoretical Computer Science
Better external memory suffix array construction

Journal of Experimental Algorithmics (JEA)
Space Efficient String Mining under Frequency Constraints

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Permuted Longest-Common-Prefix Array

CPM '09 Proceedings of the 20th Annual Symposium on Combinatorial Pattern Matching
Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

Bioinformatics
Engineering a compressed suffix tree implementation

Journal of Experimental Algorithmics (JEA)
A Linear-Time Burrows-Wheeler Transform Using Induced Sorting

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Breaking a Time-and-Space Barrier in Constructing Full-Text Indices

SIAM Journal on Computing
Sampled longest common prefix array

CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Computing matching statistics and maximal exact matches on compressed full-text indexes

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

On enumerating the DNA sequences

Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine
Space-Efficient computation of maximal and supermaximal repeats in genome sequences

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
RACE: a scalable and elastic parallel system for discovering repeats in very long sequences

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding repetitive structures in genomes and proteins is important to understand their biological functions. Many data compressors for modern genomic sequences rely heavily on finding repeats in the sequences. Small-scale and local repetitive structures are better understood than large and complex interspersed ones. The notion of maximal repeats captures all the repeats in the data in a space-efficient way. Prior work on maximal repeat finding used either a suffix tree or a suffix array along with other auxiliary data structures. Their space usage is 19-50 times the text size with the best engineering efforts, prohibiting their usability on massive data such as the whole human genome. We focus on finding all the maximal repeats from massive texts in a time- and space-efficient manner. Our technique uses the Burrows-Wheeler Transform and wavelet trees. For data sets consisting of natural language texts and protein data, the space usage of our method is no more than three times the text size. For genomic sequences stored using one byte per base, the space usage of our method is less than double the sequence size. Our space-efficient method keeps the timing performance fast. In fact, our method is orders of magnitude faster than the prior methods for processing massive texts such as the whole human genome, since the prior methods must use external memory. For the first time, our method enables a desktop computer with 8 GB internal memory (actual internal memory usage is less than 6 GB) to find all the maximal repeats in the whole human genome in less than 17 hours. We have implemented our method as general-purpose open-source software for public use.