Structural analysis of regulatory DNA sequences using grammar inference and Support Vector Machine

  • Authors:
  • Robertas Damaševičius

  • Affiliations:
  • Software Engineering Department, Kaunas University of Technology, Student 50-415, LT-51368, Kaunas, Lithuania

  • Venue:
  • Neurocomputing
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

Regulatory DNA sequences such as promoters or splicing sites control gene expression and are important for successful gene prediction. Such sequences can be recognized by certain patterns or motifs that are conserved within a species. These patterns have many exceptions which makes the structural analysis of regulatory sequences a complex problem. Grammar rules can be used for describing the structure of regulatory sequences; however, the manual derivation of such rules is not trivial. In this paper, stochastic L-grammar rules are derived automatically from positive examples and counterexamples of regulatory sequences using genetic programming techniques. The fitness of grammar rules is evaluated using a Support Vector Machine (SVM) classifier. SVM is trained on known sequences to obtain a discriminating function which serves for evaluating a candidate grammar ruleset by determining the percentage of generated sequences that are classified correctly. The combination of SVM and grammar rule inference can mitigate the lack of structural insight in machine learning approaches such as SVM.