Using machine learning to extract drug and gene relationships from text

  • Authors:
  • Russ Altman;Jeffrey T. Chang

  • Affiliations:
  • -;-

  • Venue:
  • Using machine learning to extract drug and gene relationships from text
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Interpatient variability in responses to drugs leads to millions of hospitalizations every year. To help prevent these failures, the discipline of pharmacogenomics intends to characterize the genomic profiles that may lead to undesirable drug responses. Pharmacogenomic scientists must integrate research findings across the genomic, molecular, cellular, tissue, organ, and organismic levels. To address this challenge, I have developed methods to extract information relevant to pharmacogenomics from the literature. These methods can serve as the foundation for powerful tools that help scientists synthesize information and generate new biological hypotheses. Specifically, this thesis covers novel applications and extensions of supervised machine learning algorithms to extract relationships between genes and drugs automatically. This task comprises several problems that must be solved separately. Thus, I have also developed algorithms to identify and score gene names and their abbreviations from text. I have framed these tasks as classification problems, where the computer must integrate diverse evidence to produce a decision. I identified features that captured information relevant to the problem and then encoded them into representations suitable for classification. To extract a comprehensive list of gene-drug relationships, an algorithm must find gene and protein names from text. Using such an algorithm, the computer could identify newly coined gene names. My approach to this problem achieved 83% recall at 82% precision. Since many of these names were abbreviations, e.g. TPMT for Thiopurine Methyltransferase, I developed an abbreviation identification algorithm that found these concurrences with 84% recall at 81% precision. The final algorithm classified relationships between genes and drugs into five categories with 74% accuracy. Finally, I have made these algorithms and other results available on the internet at http://bionlp.stanford.edu/. The code is available both as human-accessible web pages and computer-accessible web services.