Masking patterns in sequences: A new class of motif discovery with don't cares

  • Authors:
  • Giovanni Battaglia;Davide Cangelosi;Roberto Grossi;Nadia Pisanti

  • Affiliations:
  • Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy;Dipartimento di Informatica, Università di Pisa, Largo B. Pontecorvo 3, 56127 Pisa, Italy

  • Venue:
  • Theoretical Computer Science
  • Year:
  • 2009

Quantified Score

Hi-index 5.23

Visualization

Abstract

We introduce a new notion of motifs, called masks, that succinctly represents the repeated patterns for an input sequence T of n symbols drawn from an alphabet @S. We show how to build the set of all frequent maximal masks of length L in O(2^Ln) time and space in the worst case, using the Karp-Miller-Rosenberg approach. We analytically show that our algorithm performs better than the method based on constant-time enumerating and checking all the potential (|@S|+1)^L candidate patterns in T, after a polynomial-time preprocessing of T. Our algorithm is also cache-friendly, attaining O(2^Lsort(n)) block transfers, where sort(n) is the cache complexity of sorting n items.