Intelligent data recognition of DNA sequences using statistical models

Authors:
Jitimon Keinduangjun;Punpiti Piamsa-nga;Yong Poovorawan
Affiliations:
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Venue:
PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence
Year:
2005

Citing 4
Cited 0

Class-based n-gram models of natural language

Computational Linguistics
A document retrieval model based on term frequency ranks

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

The intelligent data acquisition in biological sequences is a hard and challenge problem since most biological sequences contain unknowledgeable, diverse and huge data. However, the intelligent data acquisition reduces a demand on the use of high computation methods because the data are more compact and more precise. We propose a novel approach for discovering sequence signatures, which are sufficiently distinctive information in identifying the sequences. The signatures are derived from the best combination of the n-grams and the statistical scoring models. From our experiments in applying them to identify the Influenza virus, we found that the identifiers constructed by too short n-gram signatures and inappropriate scoring models get low efficiency since the inappropriate combinations of n-gram signatures and scoring models bring about unbalanced class and pattern score distribution. However, the other identifiers provide accuracy over 80% and up to 100%, when they apply an appropriate combination. In addition to accomplishing in the signature recognition, our proposed approach also requires low computation time for the biological sequence identification.