A bootstrapping approach for training a NER with conditional random fields

Authors:
Jorge Teixeira;Luís Sarmento;Eugénio Oliveira
Affiliations:
LIACC - FEUP/DEI & Labs Sapo UP, Porto, Portugal;LIACC - FEUP/DEI & Labs Sapo UP, Porto, Portugal;LIACC - FEUP/DEI & Labs Sapo UP, Porto, Portugal
Venue:
EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
Year:
2011

Citing 9
Cited 1

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Gene name identification and normalization using a model organism database

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Biomedical named entity recognition using conditional random fields and rich feature sets

JNLPBA '04 Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications
Bootstrapping and evaluating named entity recognition in the biomedical domain

BioNLP '06 Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis
Updating a name tagger using contemporary unlabeled data

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Tokenizing micro-blogging messages using a text classification approach

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
REPENTINO – a wide-scope gazetteer for entity recognition in portuguese

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language

A hybrid approach of pattern extraction and semi-supervised learning for vietnamese named entity recognition

ICCCI'12 Proceedings of the 4th international conference on Computational Collective Intelligence: technologies and applications - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a bootstrapping approach for training a Named Entity Recognition (NER) system. Our method starts by annotating persons' names on a dataset of 50,000 news items. This is performed using a simple dictionary-based approach. Using such training set we build a classification model based on Conditional Random Fields (CRF). We then use the inferred classification model to perform additional annotations of the initial seed corpus, which is then used for training a new classification model. This cycle is repeated until the NER model stabilizes. We evaluate each of the bootstrapping iterations by calculating: (i) the precision and recall of the NER model in annotating a small gold-standard collection (HAREM); (ii) the precision and recall of the CRF bootstrapping annotation method over a small sample of news; and (iii) the correctness and the number of new names identified. Additionally, we compare the NER model with a dictionary-based approach, our baseline method. Results show that our bootstrapping approach stabilizes after 7 iterations, achieving high values of precision (83%) and recall (68%).