GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences

  • Authors:
  • Michael M. Yin;Jason T. L. Wang

  • Affiliations:
  • Department of Computer Science, New Jersay Institute of Technology, University Heights, Newark, NJ;Department of Computer Science, New Jersay Institute of Technology, University Heights, Newark, NJ

  • Venue:
  • Information Sciences: an International Journal - Special issue: Soft computing data mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automated detection or prediction of coding sequences from within genomic DNA has been a major rate-limiting step in the pursuit of vertebrate genes. Programs currently available are far from being powerful enough to elucidate a gent structure completely. In this paper, we present a new system, called GeneScout, for predicting gene structures in vertebrate genomic DNA. The system contains specially designed hidden Markov models (HMMs) for detecting functional sites including proteintranslation start sites, mRNA splicing junction donor and acceptor sites, etc. An HMM model is also proposed for exon coding potential computation. Our main hypothesis is that, given a vertebrate genomic DNA sequence S, it is always possible to construct a directed acyclic graph G such that the path for the actual coding region of S is in the set of all paths on G. Thus, the gene detection problem is reduced to that of analyzing the paths in the graph G. A dynamic programming algorithm is used to lind the optimal path in G. The proposed system is trained using an expectation-maximization algorithm and its performance on vertebrate gene prediction is evaluated using the 10-way cross-validation method. Experimental results show that the proposed system performs well and is comparable to existing gene discovery tools.