Finding similar regions in many sequences
Journal of Computer and System Sciences - STOC 1999
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree
LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Tries in data retrieval and syntactic pattern recognition
Tries in data retrieval and syntactic pattern recognition
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
A frequent pattern mining method for finding planted (l, d)-motifs of unknown length
RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Component-based matching for multiple interacting RNA sequences
ISBRA'11 Proceedings of the 7th international conference on Bioinformatics research and applications
Informative motifs in protein family alignments
WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
Hi-index | 0.00 |
One of the hardest and long-standing problems in Bioinformatics is the problem of motif discovery in biological sequences. It is the problem of finding recurring patterns in these sequences. Apriori is a well-known data mining algorithm. It is used to mine frequent patterns in large datasets. In this paper, we would like to apply Apriori to the common motifs discovery problem. We propose three modifications so that we can adapt the classic Apriori to our problem. First, the Trie data structure is used to store all biological sequences under examination. Second, both of the frequent pattern extraction and the candidate generation steps are done using the same data structure, the Trie . The Trie allows to simultaneously search all possible starting points in the sequence for any occurrence of the given pattern. Third, instead of using only the support as a measure to assess frequent patterns, a new measure, the normalized information content (normIC), is proposed which is able to distinguish motifs in real promoter sequences. Preliminary experiments are conducted on Tompa's benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for average sensitivity.