Trie-based apriori motif discovery approach

Authors:
Isra Al-Turaiki;Ghada Badr;Hassan Mathkour
Affiliations:
College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia;College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia;College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia
Venue:
ISBRA'12 Proceedings of the 8th international conference on Bioinformatics Research and Applications
Year:
2012

Citing 11
Cited 0

Finding similar regions in many sequences

Journal of Computer and System Sciences - STOC 1999
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Spelling Approximate Repeated or Common Motifs Using a Suffix Tree

LATIN '98 Proceedings of the Third Latin American Symposium on Theoretical Informatics
An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Tries in data retrieval and syntactic pattern recognition

Tries in data retrieval and syntactic pattern recognition
Detection of generic spaced motifs using submotif pattern mining

Bioinformatics
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
A frequent pattern mining method for finding planted (l, d)-motifs of unknown length

RSKT'10 Proceedings of the 5th international conference on Rough set and knowledge technology
Component-based matching for multiple interacting RNA sequences

ISBRA'11 Proceedings of the 7th international conference on Bioinformatics research and applications
Informative motifs in protein family alignments

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the hardest and long-standing problems in Bioinformatics is the problem of motif discovery in biological sequences. It is the problem of finding recurring patterns in these sequences. Apriori is a well-known data mining algorithm. It is used to mine frequent patterns in large datasets. In this paper, we would like to apply Apriori to the common motifs discovery problem. We propose three modifications so that we can adapt the classic Apriori to our problem. First, the Trie data structure is used to store all biological sequences under examination. Second, both of the frequent pattern extraction and the candidate generation steps are done using the same data structure, the Trie . The Trie allows to simultaneously search all possible starting points in the sequence for any occurrence of the given pattern. Third, instead of using only the support as a measure to assess frequent patterns, a new measure, the normalized information content (normIC), is proposed which is able to distinguish motifs in real promoter sequences. Preliminary experiments are conducted on Tompa's benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for average sensitivity.