DNA sequence identification by statistics-based models

Authors:
Jitimon Keinduangjun;Punpiti Piamsa-nga;Yong Poovorawan
Affiliations:
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Venue:
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Year:
2005

Citing 5
Cited 0

Class-based n-gram models of natural language

Computational Linguistics
A document retrieval model based on term frequency ranks

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Basically, one of the most important issues in identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within a considerably appropriate time span is usually in conflict with accuracy. We propose a novel approach for accurate identification of DNA sequences in shorter time by discovering sequence patterns – signatures, which are sufficiently distinctive information for the identity of a sequence. The approach is to discover the signatures from the best combination of n-gram patterns and statistics-based models, which are regularly used in the research of Information Retrieval, and then use the signatures to create identifiers. We evaluate the performance of all identifiers on three different types of organisms and three different numbers of identification classes. The experimental results showed that the difference of organisms has no effect on the performance of the proposed model; whereas the different numbers of classes slightly affect the performance. The sole use of Information Gain is changed in a small range of n-grams since the use of its pattern absence brings the unbalanced class and pattern score distribution. However, several identifiers provide over 95% and up to 100% of accuracy, when they are constructed by signatures using the appropriate n-grams and statistics-based models. Our proposed model works well in identifying DNA sequences accurately, and it requires less processing time.