How to make the most of NE dictionaries in statistical NER

Authors:
Yutaka Sasaki;Yoshimasa Tsuruoka;John McNaught;Sophia Ananiadou
Affiliations:
University of Manchester, Manchester, UK;University of Manchester, Manchester, UK;National Centre for Text Mining and University of Manchester, Manchester, UK;National Centre for Text Mining and University of Manchester, Manchester, UK
Venue:
BioNLP '08 Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing
Year:
2008

Citing 19
Cited 3

An Efficient Digital Search Algorithm by Using a Double-Array Structure

IEEE Transactions on Software Engineering
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Representing text chunks

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Extracting the names of genes and gene products with a hidden Markov model

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
GAPSCORE: finding gene and protein names one word at a time

Bioinformatics
Tuning support vector machines for biomedical named entity recognition

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Two-phase biomedical NE recognition based on SVMs

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Protein name tagging for biomedical annotation in text

BioMed '03 Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13
Improving the scalability of semi-Markov conditional random fields for named entity recognition

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Experimental Study on a Two Phase Method for Biomedical Named Entity Recognition

IEICE - Transactions on Information and Systems
Introduction to the bio-entity recognition task at JNLPBA

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Incorporating lexical knowledge into biomedical NE recognition

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Exploiting context for biomedical entity recognition: from syntax to the web

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Adapting an NER-system for German to the biomedical domain

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Exploring deep knowledge resources in biomedical name recognition

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
POSBIOTM-NER in the shared task of BioNLP/NLPBA 2004

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Biomedical named entity recognition using conditional random fields and rich feature sets

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Integrating linguistic knowledge into a conditional random fieldframework to identify biomedical named entities

Expert Systems with Applications: An International Journal

SimSem: fast approximate string matching in relation to semantic category disambiguation

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Adding text mining workflows as web services to the BioCatalogue

Proceedings of the 4th International Workshop on Semantic Web Applications and Tools for the Life Sciences
Smart Health and Wellbeing

ACM Transactions on Management Information Systems (TMIS) - Special Issue on Informatics for Smart Health and Wellbeing

Quantified Score

Hi-index	0.00

Visualization

Abstract

When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We have established a novel way to improve the NER performance by addition of NEs to an NE dictionary without retraining. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the tagger outputs with correct NE labels attached. We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding the protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names. Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.