Characterization and extraction of irredundant tandem motifs

  • Authors:
  • Laxmi Parida;Cinzia Pizzi;Simona E. Rombo

  • Affiliations:
  • IBM T.J. Watson Research Center;Department of Information Engineering, University of Padova, Italy;ICAR-CNR of Cosenza & DEIS, Università della Calabria, Italy

  • Venue:
  • SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of extracting pairs of subwords (m1,m2) from a text string s of length n, such that, given also an integer constant d in input, m1 and m2 occur in tandem within a maximum distance of d symbols in s. The main effort of this work is to eliminate the possible redundancy from the candidate set of the so found tandem motifs. To this aim, we first introduce the concept of maximality, characterized by four specific conditions, that we show to be not deducible by the corresponding notion of maximality already defined for "simple" (i.e., non tandem) motifs. Then, we further eliminate the remaining redundancy by defining the concept of irredundancy for tandem motifs. We prove that the number of non-overlapping irredundant tandems is O(d2n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. As a further contribution we show an algorithm to extract this compact irredundant index.