Distance based algorithms for small biomolecule classification and structural similarity search

Authors:
Emre Karakoc;Artem Cherkasov;S. Cenk Sahinalp
Affiliations:
-;-;-
Venue:
Bioinformatics
Year:
2006

Citing 0
Cited 2

Novel approaches for small biomolecule classification and structural similarity search

ACM SIGKDD Explorations Newsletter - Special issue on data mining for health informatics
Prior knowledge employment based on the k-l and tanimoto distances matching for intelligent autonomous robots

ICIRA'12 Proceedings of the 5th international conference on Intelligent Robotics and Applications - Volume Part III

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Structural similarity search among small molecules is a standard tool used in molecular classification and in-silico drug discovery. The effectiveness of this general approach depends on how well the following problems are addressed. The notion of similarity should be chosen for providing the highest level of discrimination of compounds wrt the bioactivity of interest. The data structure for performing search should be very efficient as the molecular databases of interest include several millions of compounds. Results: In this paper we focus on the k-nearest-neighbor search method, which, until recently was not considered for small molecule classification. The few recent applications of k-nn to compound classification focus on selecting the most relevant set of chemical descriptors which are then compared under standard Minkowski distance Lp. Here we show how to computationally design the optimal weighted Minkowski distance wLp for maximizing the discrimination between active and inactive compounds wrt bioactivities of interest. We then show how to construct pruning based k-nn search data structures for any wLp distance that minimizes similarity search time. The accuracy achieved by our classifier is better than the alternative LDA and MLR approaches and is comparable to the ANN methods. In terms of running time, our classifier is considerably faster than the ANN approach especially when large data sets are used. Furthermore, our classifier quantifies the level of bioactivity rather than returning a binary decision and thus is more informative than the ANN approach. Contact: cenk@cs.sfu.ca