Overlap-Based Similarity Metrics for Motif Search in DNA Sequences

  • Authors:
  • Hai Thanh Do;Dianhui Wang

  • Affiliations:
  • Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia 3086;Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia 3086

  • Venue:
  • ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motifs refer to a collection of transcription factor binding sites (TFBSs) which are located at promoters of genes. Discovery of motifs is critical to further understanding the mechanism of gene regulation. Computational approaches addressing this challenging problem have demonstrated good potential. However, the existing motif search approaches have some limits to deal with remarkably under-presentation of binding sites in biological datasets, resulting in considerably high false-positive rate in prediction. We resolve the task as an imbalanced biological data classification problem and our technical contributions in this paper include the following aspects: (i) propose a novel similarity metrics for comparing DNA subsequences based on overlap range of nucleotides in DNA sequences; and (ii) introduce a new sampling method which combines both over- and under-sampling techniques. The effectiveness of our proposed similarity metrics and sampling approach is demonstrated by two benchmark datasets and three classification techniques --- Neural Networks (NN), Support Vector Machine (SVM), and Learning Vector Quantization (LVQ1). Empirical studies show that the classifier LVQ1 integrated with the proposed similarity metrics performs slightly better other approaches on the test datasets.