Combinatorial pattern discovery for scientific data: some preliminary results
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Motif discovery without alignment or enumeration (extended abstract)
RECOMB '98 Proceedings of the second annual international conference on Computational molecular biology
Clustering gene expression patterns
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Some Results on Flexible-Pattern Discovery
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
A Double Combinatorial Approach to Discovering Patterns in Biological Sequences
CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Bases of Motifs for Generating Repeated Patterns with Wild Cards
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Maximal and minimal representations of gapped and non-gapped motifs of a string
Theoretical Computer Science
A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence
ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
Bridging lossy and lossless compression by motif pattern discovery
General Theory of Information Transfer and Combinatorics
Faster variance computation for patterns with gaps
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Boolean satisfiability for sequence mining
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.01 |
Given an input sequence of data, a motif is a repeating pattern, possibly interspersed with "dont care" characters and a flexible motif could have a variable (as opposed to fixed) number of "dont care" characters. Given a sequence of records with F fields each, an association rule is a common set of f fields, f 驴 F, with identical (or similar) repeating values. The data in either case could be a sequence of characters or sets of characters or even real values. It is well known that the number of motifs or association rules, say N, could potentially be exponential in the size of the input sequence or number of records, say n. In this paper we present a new algorithm to discover all flexible motifs or association rules in the input. A novel feature of this algorithm is that its running time is linear in the size of the output (ignoring polylog factors). More precisely, the complexity of the algorithm is O((n5 + N)log n). This is the first algorithm for motif discovery with a proven output sensitive complexity bound. The discovery algorithm works in two phases: in the first phase it detects a linear number of core motifs in time polynomial in the input size n and in the second phase it detects all the remaining motifs N驴 in O(N驴 logn) time. The core motifs of the first phase are also characterized as being those of "highest specificity": loosely speaking, a pattern with higher specificity has less "dont care" characters. Some applications (for instance the ones that require the study of those portions of the input sequence that contribute to the non-gapped regions of motifs) require only the core motifs. Hence for such applications, the first phase of the algorithm suffices. However, the general problem is of use in motif discovery tasks in gene or protein sequences, or discovery of association rules from gene expression data or in data mining.