Overlap-Based Similarity Metrics for Motif Search in DNA Sequences

Authors:
Hai Thanh Do;Dianhui Wang
Affiliations:
Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia 3086;Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Australia 3086
Venue:
ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
Year:
2009

Citing 14
Cited 1

Principles of Neurocomputing for Science and Engineering

Principles of Neurocomputing for Science and Engineering
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Modeling within-motif dependence for transcription factor binding site predictions

Bioinformatics
Comparative analysis of methods for representing and searching for transcription factor binding sites

Bioinformatics
Position dependencies in transcription factor binding sites

Bioinformatics
Identifying Functional Binding Motifs of Tumor Protein p53 Using Support Vector Machines

ICMLA '07 Proceedings of the Sixth International Conference on Machine Learning and Applications
DNA Motif Representation with Nucleotide Dependency

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Eukaryotic transcription factor binding sites—modeling and integrative search methods

Bioinformatics
Introduction to Information Retrieval

Introduction to Information Retrieval
Data Mining on Imbalanced Data Sets

ICACTE '08 Proceedings of the 2008 International Conference on Advanced Computer Theory and Engineering
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
SVMs modeling for highly imbalanced classification

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics - Special issue on human computing
MISCORE: mismatch-based matrix similarity scores for DNA motif detection

ICONIP'08 Proceedings of the 15th international conference on Advances in neuro-information processing - Volume Part I
Motif discoveries in unaligned molecular sequences using self-organizing neural networks

IEEE Transactions on Neural Networks

Geometric visualization of TF binding sites in context

Proceedings of the 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motifs refer to a collection of transcription factor binding sites (TFBSs) which are located at promoters of genes. Discovery of motifs is critical to further understanding the mechanism of gene regulation. Computational approaches addressing this challenging problem have demonstrated good potential. However, the existing motif search approaches have some limits to deal with remarkably under-presentation of binding sites in biological datasets, resulting in considerably high false-positive rate in prediction. We resolve the task as an imbalanced biological data classification problem and our technical contributions in this paper include the following aspects: (i) propose a novel similarity metrics for comparing DNA subsequences based on overlap range of nucleotides in DNA sequences; and (ii) introduce a new sampling method which combines both over- and under-sampling techniques. The effectiveness of our proposed similarity metrics and sampling approach is demonstrated by two benchmark datasets and three classification techniques --- Neural Networks (NN), Support Vector Machine (SVM), and Learning Vector Quantization (LVQ1). Empirical studies show that the classifier LVQ1 integrated with the proposed similarity metrics performs slightly better other approaches on the test datasets.