A linguistic approach to classification of bacterial genomes

  • Authors:
  • Zeev Volkovich;Valery Kirzhner;Zeev Barzily;Sergey Hosid;Katerina Korenblat

  • Affiliations:
  • Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel;Institute of Evolution, University of Haifa, Haifa 31905, Israel;Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel;Institute of Evolution, University of Haifa, Haifa 31905, Israel;Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel

  • Venue:
  • Pattern Recognition
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

In the present paper, 188 prokaryote genomes are classified by separately calculating the compositional spectra for the coding and the non-coding parts of the genomes. For each subsequence, the compositional spectrum is transformed into the corresponding point in a vector space. This enables the categorization of genomes into meaningful groups by a formal method. Repeated clustering performed for the coding and the non-coding genome parts makes it possible to estimate the true number of the genome clusters. The method we propose is based on a new application of external cluster validation indexes and on the misclassified quantities obtained in the process of repeated clustering. Besides, we have constructed additional data embedding into the appropriate Euclidean space only on the basis of the distances between compositional spectra. Biological evaluation of the results obtained for the 4-letter and the 2-letter alphabets substantiates the appropriateness of the resulting cluster-based classification.