Class-based n-gram models of natural language
Computational Linguistics
A document retrieval model based on term frequency ranks
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Hi-index | 0.00 |
The intelligent data acquisition in biological sequences is a hard and challenge problem since most biological sequences contain unknowledgeable, diverse and huge data. However, the intelligent data acquisition reduces a demand on the use of high computation methods because the data are more compact and more precise. We propose a novel approach for discovering sequence signatures, which are sufficiently distinctive information in identifying the sequences. The signatures are derived from the best combination of the n-grams and the statistical scoring models. From our experiments in applying them to identify the Influenza virus, we found that the identifiers constructed by too short n-gram signatures and inappropriate scoring models get low efficiency since the inappropriate combinations of n-gram signatures and scoring models bring about unbalanced class and pattern score distribution. However, the other identifiers provide accuracy over 80% and up to 100%, when they apply an appropriate combination. In addition to accomplishing in the signature recognition, our proposed approach also requires low computation time for the biological sequence identification.