Assessing the correlation between contextual patterns and biological entity tagging

Authors:
M. Krallinger;M. Padrón;C. Blaschke;A. Valencia
Affiliations:
National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid;National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid;National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid;National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid
Venue:
JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Year:
2004

Citing 2
Cited 1

Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology

Feature generation and representations for protein-protein interaction classification

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The tagging of biological entities, and in particular gene and protein names, is an essential step in the analysis of textual information in Molecular Biology and Biomedicine. The problem is harder than was originally thought because of the highly dynamic nature of the research area, in which new genes and their functions are constantly being discovered, and because of the lack of commonly accepted standards. An impressive collection of techniques has been used to detect protein and gene names in the last four-five years, ranging from typical NLP to purely bioinformatics approaches. We explore here the relationship between protein/gene names and expressions used to characterize protein/gene function. These expressions are captured in a collection of patterns derived from an original set of manually derived expressions, extended to cover lexical variants and filtered with known cases of association patterns/names. Applying these patterns to a large collection of curated sentences, we found a significant number of patterns with a very strong tendency to appear only in sentences in which a protein/gene name is simultaneously present. This approach is part of a larger effort to incorporate contextual information so as to make biological information less ambiguous.