The method of N-grams in large-scale clustering of DNA texts

  • Authors:
  • Z. Volkovich;V. Kirzhner;A. Bolshoy;E. Nevo;A. Korol

  • Affiliations:
  • Department of Software Engineering, ORT Braude College, P.O. Box 78, Karmiel, 20101, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel;Institute of Evolution, University of Haifa, Mount Carmel, Haifa 31905, Israel

  • Venue:
  • Pattern Recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper is devoted to the techniques of clustering of texts based on the comparison of vocabularies of N-grams. In contrast to the regular N-grams approach, the proposed N-grams method is based on calculation of imperfect occurrences of N-grams in a text up to a number of mismatched strings. We demonstrated that such an approach essentially improves the resolving capacity of the N-grams method for DNA texts. Additionally, we discuss a mutual usage scheme of different clustering technique types to verify the partition quality.