Efficient Extraction of Protein-Protein Interactions from Full-Text Articles

Authors:
Jörg Hakenberg;Robert Leaman;Nguyen Ha Vo;Siddhartha Jonnalagadda;Ryan Sullivan;Christopher Miller;Luis Tari;Chitta Baral;Graciela Gonzalez
Affiliations:
Arizona State University, Tempe, AZ;Arizona State University, Phoenix, AZ;Arizona State University, Tempe, AZ;Arizona State University, Phoenix, AZ;Arizona State University, Phoenix, AZ;Arizona State University, Phoenix, AZ;Hoffmann-La Roche Inc., Nutley, NJ;Arizona State University, Tempe, AZ;Arizona State University, Phoenix, AZ
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2010

Citing 14
Cited 3

RelEx---Relation extraction using dependency parse trees

Bioinformatics
Wide-coverage efficient statistical parsing with ccg and log-linear models

Computational Linguistics
Inter-species normalization of gene mentions with GNAT

Bioinformatics
TREC genomics special issue overview

Information Retrieval
High-performance information extraction with AliBaba

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Evaluating contributions of natural language parsers to protein–protein interaction extraction

Bioinformatics
High-performance gene name normalization with GeNo

Bioinformatics
Self-training for biomedical parsing

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
A graph kernel for protein-protein interaction extraction

BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Overview of BioNLP'09 shared task on event extraction

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task
Comparative experiments on learning information extractors for proteins and their interactions

Artificial Intelligence in Medicine
Measuring prediction capacity of individual verbs for the identification of protein interactions

Journal of Biomedical Informatics
An Overview of BioCreative II.5

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Not all links are equal: exploiting dependency types for the extraction of protein-protein interactions from text

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Automatic approaches for gene-drug interaction extraction from biomedical text: corpus and comparative evaluation

BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Polarity Analysis for Food and Disease Relationships

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01

Quantified Score

Hi-index	0.00

Visualization

Abstract

Proteins and their interactions govern virtually all cellular processes, such as regulation, signaling, metabolism, and structure. Most experimental findings pertaining to such interactions are discussed in research papers, which, in turn, get curated by protein interaction databases. Authors, editors, and publishers benefit from efforts to alleviate the tasks of searching for relevant papers, evidence for physical interactions, and proper identifiers for each protein involved. The BioCreative II.5 community challenge addressed these tasks in a competition-style assessment to evaluate and compare different methodologies, to make aware of the increasing accuracy of automated methods, and to guide future implementations. In this paper, we present our approaches for protein-named entity recognition, including normalization, and for extraction of protein-protein interactions from full text. Our overall goal is to identify efficient individual components, and we compare various compositions to handle a single full-text article in between 10 seconds and 2 minutes. We propose strategies to transfer document-level annotations to the sentence-level, which allows for the creation of a more fine-grained training corpus; we use this corpus to automatically derive around 5,000 patterns. We rank sentences by relevance to the task of finding novel interactions with physical evidence, using a sentence classifier built from this training corpus. Heuristics for paraphrasing sentences help to further remove unnecessary information that might interfere with patterns, such as additional adjectives, clauses, or bracketed expressions. In BioCreative II.5, we achieved an f-score of 22 percent for finding protein interactions, and 43 percent for mapping proteins to UniProt IDs; disregarding species, f-scores are 30 percent and 55 percent, respectively. On average, our best-performing setup required around 2 minutes per full text. All data and pattern sets as well as Java classes that extend third-party software are available as supplementary information (see Appendix).