Divide and Conquer Machine Learning for a Genomics Analogy Problem (Progress Report)

  • Authors:
  • Ming Ouyang;John Case;Joan Burnside

  • Affiliations:
  • -;-;-

  • Venue:
  • DS '01 Proceedings of the 4th International Conference on Discovery Science
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Genomic strings are not of fixed length, but provide one-dimensional spatial data that do not divide for conquering by machine learning into manageable fixed size chunks obeying Dietterich's independent and identically distributed assumption. We nonetheless need to divide genomic strings for conquering by machine learning -- in this case for genomic prediction.Orthologs are genomic strings derived from a common ancestor and having the same biological function. Ortholog detection is biologically interesting since it informs us about protein divergence through evolution, and, in the present context, also has important agricultural applications. In the present paper is indicated means to obtain an associated (fixed size) attribute vector for genomic string data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem. The attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential metrics, many new to bioinformatics. Many of the differential metrics are based on evolutionary considerations, both theoretical and empirically observed, in some cases observed by the authors.C5.0 with AdaBoosting activated was employed and the preliminary results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques described for ortholog detection on the more readily available EST (incomplete) genomic data.