Classifying gene sentences in biomedical literature by combining high-precision gene identifiers

Authors:
Sun Kim;Won Kim;Don Comeau;W. John Wilbur
Affiliations:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD;National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
Venue:
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Year:
2012

Citing 6
Cited 0

Discovering word senses from text

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting the contextual cues for bio-entity name recognition in biomedical literature

Journal of Biomedical Informatics
Linguistically motivated large-scale NLP with C&C and boxer

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
A priority model for named entities

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
From frequency to meaning: vector space models of semantics

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Gene name identification is a fundamental step to solve more complicated text mining problems such as gene normalization and protein-protein interactions. However, state-of-the-art name identification methods are not yet sufficient for use in a fully automated system. In this regard, a relaxed task, gene/protein sentence identification, may serve more effectively for manually searching and browsing biomedical literature. In this paper, we set up a new task, gene/protein sentence classification and propose an ensemble approach for addressing this problem. Well-known named entity tools use similar gold-standard sets for training and testing, which results in relatively poor performance for unknown sets. We here explore how to combine diverse high-precision gene identifiers for more robust performance. The experimental results show that the proposed approach outperforms BANNER as a stand-alone classifier for newly annotated sets as well as previous gold-standard sets.