Rule-Based Protein Term Identification with Help from Automatic Species Tagging

Authors:
Xinglong Wang
Affiliations:
School of Informatics, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH8 9LW, Scotland
Venue:
CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 7
Cited 1

Automatic scientific text classification using local patterns: KDD CUP 2002 (task 1)

ACM SIGKDD Explorations Newsletter
Rutabaga by any other name: extracting biological names

Journal of Biomedical Informatics - Special issue: Sublanguage
Gene name identification and normalization using a model organism database

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Gene name ambiguity of eukaryotic nomenclatures

Bioinformatics
Enhancing automatic term recognition through recognition of variation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Human gene name normalization using text matching with automatically extracted synonym dictionaries

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis

Medical entity recognition: a comparison of semantic and statistical methods

BioNLP '11 Proceedings of BioNLP 2011 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

In biomedical articles, terms often refer to different protein entities. For example, an arbitrary occurrence of term p53might denote thousands of proteins across a number of species. A human annotator is able to resolve this ambiguity relatively easily, by looking at its context and if necessary, by searching an appropriate protein database. However, this phenomenon may cause much trouble to a text mining system, which does not understand human languages and hence can not identify the correct protein that the term refers to. In this paper, we present a Term Identification system which automatically assigns unique identifiers, as found in a protein database, to ambiguous protein mentions in texts. Unlike other solutions described in literature, which only work on gene/protein mentions on a specific model organism, our system is able to tackle protein mentions across many species, by integrating a machine-learning based species tagger. We have compared the performance of our automatic system to that of human annotators, with very promising results.