Class-based n-gram models of natural language
Computational Linguistics
A document retrieval model based on term frequency ranks
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Hi-index | 0.00 |
Basically, one of the most important issues in identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within a considerably appropriate time span is usually in conflict with accuracy. We propose a novel approach for accurate identification of DNA sequences in shorter time by discovering sequence patterns – signatures, which are sufficiently distinctive information for the identity of a sequence. The approach is to discover the signatures from the best combination of n-gram patterns and statistics-based models, which are regularly used in the research of Information Retrieval, and then use the signatures to create identifiers. We evaluate the performance of all identifiers on three different types of organisms and three different numbers of identification classes. The experimental results showed that the difference of organisms has no effect on the performance of the proposed model; whereas the different numbers of classes slightly affect the performance. The sole use of Information Gain is changed in a small range of n-grams since the use of its pattern absence brings the unbalanced class and pattern score distribution. However, several identifiers provide over 95% and up to 100% of accuracy, when they are constructed by signatures using the appropriate n-grams and statistics-based models. Our proposed model works well in identifying DNA sequences accurately, and it requires less processing time.