Identification of lead compounds in pharmaceutical data using data mining techniques

  • Authors:
  • Christodoulos A. Nicolaou

  • Affiliations:
  • Bioreason, Inc., Santa Fe, NM

  • Venue:
  • PCI'01 Proceedings of the 8th Panhellenic conference on Informatics
  • Year:
  • 2001

Quantified Score

Hi-index 0.01

Visualization

Abstract

As the use of High-Throughput Screening (HTS) systems becomes more routine in the drug discovery process, there is an increasing need for fast and reliable analysis of the massive amounts of resulting biological data. At the forefront of the methods used for analyzing HTS data is cluster analysis. It is used in this context to find natural groups in the data, thereby revealing families of compounds that exhibit increased activity towards a specific biological target. Scientists in this area have traditionally used a number of clustering algorithms, distance (similarity) measures, and compound representation methods. We first discuss the nature of chemical and biological data and how it adversely impacts the current analysis methodology. We emphasize the inability of widely used methods to discover the chemical families in a pharmaceutical dataset and point out specific problems occurring when one attempts to apply these common clustering and other statistical methods on chemical data. We then introduce a new, data-mining algorithm that employs a newly proposed clustering method and expert knowledge to accommodate user requests and produce chemically sensible results. This new, chemically aware algorithm employs molecular structure to find true chemical structural families of compounds in pharmaceutical data, while at the same time accommodates the multi-domain nature of chemical compounds.