Intelligent data recognition of DNA sequences using statistical models

  • Authors:
  • Jitimon Keinduangjun;Punpiti Piamsa-nga;Yong Poovorawan

  • Affiliations:
  • Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Computer Engineering, Faculty of Engineering, Kasetsart University, Bangkok, Thailand;Department of Pediatrics, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand

  • Venue:
  • PReMI'05 Proceedings of the First international conference on Pattern Recognition and Machine Intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The intelligent data acquisition in biological sequences is a hard and challenge problem since most biological sequences contain unknowledgeable, diverse and huge data. However, the intelligent data acquisition reduces a demand on the use of high computation methods because the data are more compact and more precise. We propose a novel approach for discovering sequence signatures, which are sufficiently distinctive information in identifying the sequences. The signatures are derived from the best combination of the n-grams and the statistical scoring models. From our experiments in applying them to identify the Influenza virus, we found that the identifiers constructed by too short n-gram signatures and inappropriate scoring models get low efficiency since the inappropriate combinations of n-gram signatures and scoring models bring about unbalanced class and pattern score distribution. However, the other identifiers provide accuracy over 80% and up to 100%, when they apply an appropriate combination. In addition to accomplishing in the signature recognition, our proposed approach also requires low computation time for the biological sequence identification.