Building a generic graph-based descriptor set for use in drug discovery

  • Authors:
  • Phillip Lock;Nicolas Le Mercier;Jiuyong Li;Markus Stumptner

  • Affiliations:
  • University of South Australia, Mawson Lakes, South Australia;Ecole Nationale Superieure de Techniques Avancées, Paris, France;University of South Australia, Mawson Lakes, South Australia;University of South Australia, Mawson Lakes, South Australia

  • Venue:
  • AusDM '09 Proceedings of the Eighth Australasian Data Mining Conference - Volume 101
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The ability to predict drug activity from molecular structure is an important field of research both in academia and in the pharmaceutical industry. Raw 3D structure data is not in a form suitable for identifying properties using machine learning so it must be reconfigured into descriptor sets that continue to encapsulate important structural properties of the molecule. In this study, a large number of small molecule structures, obtained from publicly available databases, was used to generate a set of molecular descriptors that can be used with machine learning to predict drug activity. The descriptors were for the most part simple graph strings representing chains of connected atoms. Atom counts averaging seventy, using a dataset of just over one million molecules, resulted in a very large set of simple graph strings of lengths two to twelve atoms. Elimination of duplicates, reverse strings and feature reduction techniques were applied to reduce the path count to about three thousand which was viable for machine learning. Training data from twenty six data sets was used to build a decision tree classifier using J48 and Random Forest. Forty three thousand molecules from the NCI HIV dataset were used with the descriptor set to generate decision tree models with good accuracy. A similar algorithm was used to extract ring structures in the molecules. Inclusion of thirteen ring structure descriptors increased the accuracy of prediction.