Assessing the correlation between contextual patterns and biological entity tagging

  • Authors:
  • M. Krallinger;M. Padrón;C. Blaschke;A. Valencia

  • Affiliations:
  • National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid;National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid;National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid;National Center of Biotechnology (CNB-CSIC), Cantoblanco, Madrid

  • Venue:
  • JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

The tagging of biological entities, and in particular gene and protein names, is an essential step in the analysis of textual information in Molecular Biology and Biomedicine. The problem is harder than was originally thought because of the highly dynamic nature of the research area, in which new genes and their functions are constantly being discovered, and because of the lack of commonly accepted standards. An impressive collection of techniques has been used to detect protein and gene names in the last four-five years, ranging from typical NLP to purely bioinformatics approaches. We explore here the relationship between protein/gene names and expressions used to characterize protein/gene function. These expressions are captured in a collection of patterns derived from an original set of manually derived expressions, extended to cover lexical variants and filtered with known cases of association patterns/names. Applying these patterns to a large collection of curated sentences, we found a significant number of patterns with a very strong tendency to appear only in sentences in which a protein/gene name is simultaneously present. This approach is part of a larger effort to incorporate contextual information so as to make biological information less ambiguous.