Classifying gene sentences in biomedical literature by combining high-precision gene identifiers

  • Authors:
  • Sun Kim;Won Kim;Don Comeau;W. John Wilbur

  • Affiliations:
  • National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD

  • Venue:
  • BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Gene name identification is a fundamental step to solve more complicated text mining problems such as gene normalization and protein-protein interactions. However, state-of-the-art name identification methods are not yet sufficient for use in a fully automated system. In this regard, a relaxed task, gene/protein sentence identification, may serve more effectively for manually searching and browsing biomedical literature. In this paper, we set up a new task, gene/protein sentence classification and propose an ensemble approach for addressing this problem. Well-known named entity tools use similar gold-standard sets for training and testing, which results in relatively poor performance for unknown sets. We here explore how to combine diverse high-precision gene identifiers for more robust performance. The experimental results show that the proposed approach outperforms BANNER as a stand-alone classifier for newly annotated sets as well as previous gold-standard sets.