Novel approaches for small biomolecule classification and structural similarity search

  • Authors:
  • Emre Karakoc;Artem Cherkasov;S. Cenk Sahinalp

  • Affiliations:
  • SFU Lab for Computational, Biology, Burnaby, Canada;UBC Division of Infectious, Diseases, Vancouver, Canada;SFU Lab for Computational, Biology, Burnaby, Canada

  • Venue:
  • ACM SIGKDD Explorations Newsletter - Special issue on data mining for health informatics
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Structural similarity search among small molecules is a standard tool used in molecular classification and in-silico drug discovery. The effectiveness of this general approach depends on how well the following problems are addressed. The notion of similarity should be chosen for providing the highest level of discrimination of compounds with respect to the bioactivity of interest. The data structure for performing search should be very efficient as the molecular databases of interest include several millions of compounds. In this paper we summarize the recent applications of k-nearest-neighbor search method for small molecule classification. The k-nn classification of small molecules is based on selecting the most relevant set of chemical descriptors which are then compared under standard Minkowski distance Lp. Here we describe how to computationally design the optimal weighted Minkowski distance wLp for maximizing the discrimination between active and inactive compounds wrt bioactivities of interest. k-nn classification requires fast similarity search for predicting bioactivity of a new molecule. We then focus on construction of pruning based k-nn search data structures for any wLp distance that minimizes similarity search time. The accuracy achieved by k-nn classifier is better than the alternative LDA and MLR approaches and is comparable to the ANN methods. In terms of running time, k-nn classifier is considerably faster than the ANN approach especially when large data sets are used. Furthermore, k-nn classifier is capable of quantification of the level of bioactivity rather than returning a binary decision and can bring more insight to the nature of the activity via eliminating unrelated descriptors of the compounds with respect to the activity in question.