An Output-Sensitive Flexible Pattern Discovery Algorithm

Authors:
Laxmi Parida;Isidore Rigoutsos;Dan Platt
Affiliations:
-;-;-
Venue:
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Year:
2001

Citing 6
Cited 6

Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Motif discovery without alignment or enumeration (extended abstract)

RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Clustering gene expression patterns

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Some Results on Flexible-Pattern Discovery

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
A Double Combinatorial Approach to Discovering Patterns in Biological Sequences

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching

Bases of Motifs for Generating Repeated Patterns with Wild Cards

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Maximal and minimal representations of gapped and non-gapped motifs of a string

Theoretical Computer Science
A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence

ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Bridging lossy and lossless compression by motif pattern discovery

General Theory of Information Transfer and Combinatorics
Faster variance computation for patterns with gaps

MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Boolean satisfiability for sequence mining

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.01

Visualization

Abstract

Given an input sequence of data, a motif is a repeating pattern, possibly interspersed with "dont care" characters and a flexible motif could have a variable (as opposed to fixed) number of "dont care" characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, f 驴 F, with identical (or similar) repeating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O((n5 + N)log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N驴 in O(N驴 logn) time. The core motifs of the first phase are also characterized as being those of "highest specificity": loosely speaking, a pattern with higher specificity has less "dont care" characters. Some applications (for instance the ones that require the study of those portions of the input sequence that contribute to the non-gapped regions of motifs) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.