Computing all repeats using suffix arrays

  • Authors:
  • Frantisek Franěk;William F. Smyth;Yudong Tang

  • Affiliations:
  • Algorithms Research Group, Department of Computing and Software, McMaster University, Hamilton, Ontario, Canada;Algorithms Research Group, Department of Computing and Software, McMaster University, Hamilton, Ontario, Canada and School of Computing, Curtin University, Perth, Australia;Algorithms Research Group, Department of Computing and Software, McMaster University, Hamilton, Ontario, Canada

  • Venue:
  • Journal of Automata, Languages and Combinatorics - Special issue: Selected papers of the 13th Australasian workshop on combinatorial algorithms
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe an algorithm that identifies all the repeating substrings (tandem, overlapping, and split) in a given string x = X[1..n]. Given the suffix arrays of x and of the reversed string x, the algorithm requires Θ(n) time for its execution and represents its output in Θ(n) space, either as a reduced suffix array (called an NE array) or as a reduced suffix tree (called an NE tree). The output substrings u are nonextendible (NE); that is, any extension of some occurrence of u in x, either to the left or to the right, yields a string (λu or uλ) that is unequal to the same extension of some other occurrence of u. Thus the number of substrings output is the minimum required to identify all the repeating substrings in x. The output can be used in a straightforward way to identify only repeating substrings that satisfy some proximity or minimum length condition.