A hybrid model for annotating named entity training corpora

Authors:
Robert Voyer;Valerie Nygaard;Will Fitzgerald;Hannah Copperman
Affiliations:
Microsoft, San Francisco, CA;Microsoft, San Francisco, CA;Microsoft, San Francisco, CA;Microsoft, San Francisco, CA
Venue:
LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Year:
2010

Citing 3
Cited 2

Automatic acquisition of named entity tagged corpus from world wide web

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Analysing Wikipedia and gold-standard corpora for NER training

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics

Active learning with Amazon Mechanical Turk

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a two-phase, hybrid model for generating training data for Named Entity Recognition systems. In the first phase, a trained annotator labels all named entities in a text irrespective of type. In the second phase, naïve crowdsourcing workers complete binary judgment tasks to indicate the type(s) of each entity. Decomposing the data generation task in this way results in a flexible, reusable corpus that accommodates changes to entity type taxonomies. In addition, it makes efficient use of precious trained annotator resources by leveraging highly available and cost effective crowdsourcing worker pools in a way that does not sacrifice quality.