Discovering subword associations in strings in time linear in the output size

  • Authors:
  • Alberto Apostolico;Giorgio Satta

  • Affiliations:
  • College of Computing, Georgia Institute of Technology, 801 Atlantic Dr., Atlanta 30332 GA, USA and Department of Information Engineering, University of Padua, via Gradenigo, 6/A, I-35131 Padova, I ...;Department of Information Engineering, University of Padua, via Gradenigo, 6/A, I-35131 Padova, Italy

  • Venue:
  • Journal of Discrete Algorithms
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a text string x of n symbols and an integer constant d, we consider the problem of finding, for any pair (y,z) of subwords of x, the tandem index associated with the pair, which is defined as the number of times that y and z occur in tandem (i.e., with no intermediate occurrence of either one of them) within a distance of d symbols of x. Although in principle there might be O(n^4) distinct subword pairs in x, it is seen that it suffices to consider a family of only O(n^2) such pairs, with the property that for any neglected pair (y^',z^') there exists a corresponding pair (y,z) contained in our family such that: (i) y^' is a prefix of y and z^' is a prefix of z; and (ii) the tandem index of (y^',z^') equals that of (y,z). The main contribution of the paper consists of an algorithm showing that the computation of all non-zero tandem indices for a string can be carried out optimally in time and space linear in the size of the output.