Signature recognition methods for identifying influenza sequences

Authors:
Jitimon Keinduangjun;Punpiti Piamsa-nga;Yong Poovorawan
Affiliations:
Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Venue:
AIME'05 Proceedings of the 10th conference on Artificial Intelligence in Medicine
Year:
2005

Citing 5
Cited 0

Class-based n-gram models of natural language

Computational Linguistics
A document retrieval model based on term frequency ranks

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

Basically, one of the most important issues for identifying biological sequences is accuracy; however, since the exponential growth and excessive diversity of biological data, the requirement to compute within considerably appropriate time usually compromises with accuracy. We propose novel approaches for accurately identifying DNA sequences in shorter time by discovering sequence patterns – signatures, which are enough distinctive information for the sequence identification. The approaches are to find the best combination of n-gram patterns and six statistical scoring algorithms, which are regularly used in the research of Information Retrieval, and then employ the signatures to create a similarity scoring model for identifying the DNA. We generate two approaches to discover the signatures. For the first one, we use only statistical information extracted directly from the sequences to discover the signatures. For the second one, we use prior knowledge of the DNA in the signature discovery process. From our experiments on influenza virus, we found that: 1) our technique can identify the influenza virus at the accuracy of up to 99.69% when 11-gram is used and the prior knowledge is applied; 2) the use of too short or too long signatures produces lower efficiency; and 3) most scoring algorithms are good for identification except the “Rocchio algorithm” where its results are approximately 9% lower than the others. Moreover, this technique can be applied for identifying other organisms.