Automatic approaches for gene-drug interaction extraction from biomedical text: corpus and comparative evaluation

Authors:
Nate Sutton;Laura Wojtulewicz;Neel Mehta;Graciela Gonzalez
Affiliations:
Arizona State University, Tempe, Arizona;Arizona State University, Tempe, Arizona;Arizona State University, Tempe, Arizona;Arizona State University, Tempe, Arizona
Venue:
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Year:
2012

Citing 5
Cited 1

EBIMed---text crunching to gather facts for proteins from Medline

Bioinformatics
Bayesian inference of protein–protein interactions from biological literature

Bioinformatics
Content analysis: What are they talking about?

Computers & Education - Methodological issue in researching CSCL
Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Using text to build semantic networks for pharmacogenomics

Journal of Biomedical Informatics

The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Publications that report genotype-drug interaction findings, as well as manually curated databases such as DrugBank and PharmGKB are essential to advancing pharmacogenomics, a relatively new area merging pharmacology and genomic research. Natural language processing (NLP) methods can be very useful for automatically extracting knowledge such as gene-drug interactions, offering researchers immediate access to published findings, and allowing curators a shortcut for their work. We present a corpus of gene-drug interactions for evaluating and training systems to extract those interactions. The corpus includes 551 sentences that have a mention of a drug and a gene from about 600 journals found to be relevant to pharmacogenomics through an analysis of gene-drug relationships in the PharmGKB knowledgebase. We evaluated basic approaches to automatic extraction, including gene and drug co-occurrence, co-occurrence plus interaction terms, and a linguistic pattern-based method. The linguistic pattern method had the highest precision (96.61%) but lowest recall (7.30%), for an f-score of 13.57%. Basic co-occurrence yields 68.99% precision, with the addition of an interaction term precision increases slightly (69.60%), though not as much as could be expected. Co-occurrence is a reasonable baseline method, with pattern-based being a promising approach if enough patterns can be generated to address recall. The corpus is available at http://diego.asu.edu/index.php/projects