Novel approaches for small biomolecule classification and structural similarity search

Authors:
Emre Karakoc;Artem Cherkasov;S. Cenk Sahinalp
Affiliations:
SFU Lab for Computational, Biology, Burnaby, Canada;UBC Division of Infectious, Diseases, Vancouver, Canada;SFU Lab for Computational, Biology, Burnaby, Canada
Venue:
ACM SIGKDD Explorations Newsletter - Special issue on data mining for health informatics
Year:
2007

Citing 3
Cited 1

Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Neural Networks in Chemistry and Drug Design

Neural Networks in Chemistry and Drug Design
Distance based algorithms for small biomolecule classification and structural similarity search

Bioinformatics

Substructure similarity measurement in chinese recipes

Proceedings of the 17th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structural similarity search among small molecules is a standard tool used in molecular classification and in-silico drug discovery. The effectiveness of this general approach depends on how well the following problems are addressed. The notion of similarity should be chosen for providing the highest level of discrimination of compounds with respect to the bioactivity of interest. The data structure for performing search should be very efficient as the molecular databases of interest include several millions of compounds. In this paper we summarize the recent applications of k-nearest-neighbor search method for small molecule classification. The k-nn classification of small molecules is based on selecting the most relevant set of chemical descriptors which are then compared under standard Minkowski distance Lp. Here we describe how to computationally design the optimal weighted Minkowski distance wLp for maximizing the discrimination between active and inactive compounds wrt bioactivities of interest. k-nn classification requires fast similarity search for predicting bioactivity of a new molecule. We then focus on construction of pruning based k-nn search data structures for any wLp distance that minimizes similarity search time. The accuracy achieved by k-nn classifier is better than the alternative LDA and MLR approaches and is comparable to the ANN methods. In terms of running time, k-nn classifier is considerably faster than the ANN approach especially when large data sets are used. Furthermore, k-nn classifier is capable of quantification of the level of bioactivity rather than returning a binary decision and can bring more insight to the nature of the activity via eliminating unrelated descriptors of the compounds with respect to the activity in question.