Improving the performance of dictionary-based approaches in protein name recognition

Authors:
Yoshimasa Tsuruoka;Jun'ichi Tsujii
Affiliations:
CREST, Japan Science and Technology (JST) Agency, Honcho 4-1-8, Kawaguchi-shi, Saitama 332-0012, Japan;Department of Computer Science, University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo 113-0033, Japan
Venue:
Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Year:
2004

Citing 12
Cited 15

The nature of statistical learning theory

The nature of statistical learning theory
A maximum entropy approach to natural language processing

Computational Linguistics
Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Use of support vector machines in extended named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Two-phase biomedical NE recognition based on SVMs

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
The GENIA corpus: an annotated research abstract corpus in molecular biology domain

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Introduction: named entity recognition in biomedicine

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Semantic retrieval for the accurate identification of relational concepts in massive textbases

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
An intelligent search engine and GUI-based efficient MEDLINE search tool based on deep syntactic parsing

COLING-ACL '06 Proceedings of the COLING/ACL on Interactive presentation sessions
Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification

Journal of Biomedical Informatics
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
@Note: A workbench for Biomedical Text Mining

Journal of Biomedical Informatics
Investigation of unsupervised pattern learning techniques for bootstrap construction of a medical treatment lexicon

BioNLP '09 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Using UMLS to construct a generalized hierarchical concept-based dictionary of brain functions for information extraction from the fMRI literature

Journal of Biomedical Informatics
Unsupervised gene/protein named entity normalization using automatically extracted dictionaries

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Two learning approaches for protein name extraction

Journal of Biomedical Informatics
Discovering genes-diseases associations from specialized literature using the grid

IEEE Transactions on Information Technology in Biomedicine - Special section on biomedical informatics
Classifier subset selection for biomedical named entity recognition

Applied Intelligence
Graph-based concept identification and disambiguation for enterprise search

Proceedings of the 19th international conference on World wide web
An environment for data analysis in biomedical domain: information extraction for decision support systems

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I
Headwords and suffixes in biomedical names

KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dictionary-based protein name recognition is often a first step in extracting information from biomedical documents because it can provide ID information on recognized terms. However, dictionary-based approaches present two fundamental difficulties: (1) false recognition mainly caused by short names; (2) low recall due to spelling variations. In this paper, we tackle the former problem using machine learning to filter out false positives and present two alternative methods for alleviating the latter problem of spelling variations. The first is achieved by using approximate string searching, and the second by expanding the dictionary with a probabilistic variant generator, which we propose in this paper. Experimental results using the GENIA corpus revealed that filtering using a naive Bayes classifier greatly improved precision with only a slight loss of recall, resulting in 10.8% improvement in F-measure, and dictionary expansion with the variant generator gave further 1.6% improvement and achieved an F-measure of 66.6%.