Self-Organizing Approach for Automated Gene Identification

  • Authors:
  • Audrey Yu. Zinovyev;Alexander N. Gorban;Tatyana G. Popova

  • Affiliations:
  • Institut des Hautas Etudes Scientifiques, France, e-mail: zinovyev@ihes.fr;Institute of Computational Modeling of Russian Academy of Sciences Akademgorodok, Krasnoyarsk, 660036 Russia, e-mail: gorban@icm.krasn.ru;Institute of Computational Modeling of Russian Academy of Sciences Akademgorodok, Krasnoyarsk, 660036 Russia, e-mail: tanya@icm.krasn.ru

  • Venue:
  • Open Systems & Information Dynamics
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Self-training technique for automated gene recognition both in entire genomes and in unassembled ones is proposed. It is based on a simple measure (namely, the vector of frequencies of non-overlapping triplets in sliding window), and needs neither predetermined information, nor preliminary learning. The sliding window length is the only one tuning parameter. It should be chosen close to the average exon length typical to the DNA text under investigation. An essential feature of the technique proposed is preliminary visualization of the set of vectors in the subspace of the first three principal components. It was shown, the distribution of DNA sites has the bullet-like structure with one central cluster (corresponding to non-coding sites) and three or six flank ones (corresponding to protein-coding sites). The bullet-like structure itself revealed in the distribution seems to be very interesting illustration of triplet usage in DNA sequence. The method was examined on several genomes (mitochondrion of P.wickerhamii, bacteria C.crescentus and primitive eukaryot S.cerevisiae). The percentage of truly predicted nucleotides exceeds 90%.