An Output-Sensitive Flexible Pattern Discovery Algorithm

  • Authors:
  • Laxmi Parida;Isidore Rigoutsos;Dan Platt

  • Affiliations:
  • -;-;-

  • Venue:
  • CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
  • Year:
  • 2001

Quantified Score

Hi-index 0.01

Visualization

Abstract

Given an input sequence of data, a motif is a repeating pattern, possibly interspersed with "dont care" characters and a flexible motif could have a variable (as opposed to fixed) number of "dont care" characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, f 驴 F, with identical (or similar) repeating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O((n5 + N)log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N驴 in O(N驴 logn) time. The core motifs of the first phase are also characterized as being those of "highest specificity": loosely speaking, a pattern with higher specificity has less "dont care" characters. Some applications (for instance the ones that require the study of those portions of the input sequence that contribute to the non-gapped regions of motifs) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.