A hybrid model for annotating named entity training corpora

  • Authors:
  • Robert Voyer;Valerie Nygaard;Will Fitzgerald;Hannah Copperman

  • Affiliations:
  • Microsoft, San Francisco, CA;Microsoft, San Francisco, CA;Microsoft, San Francisco, CA;Microsoft, San Francisco, CA

  • Venue:
  • LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a two-phase, hybrid model for generating training data for Named Entity Recognition systems. In the first phase, a trained annotator labels all named entities in a text irrespective of type. In the second phase, naïve crowdsourcing workers complete binary judgment tasks to indicate the type(s) of each entity. Decomposing the data generation task in this way results in a flexible, reusable corpus that accommodates changes to entity type taxonomies. In addition, it makes efficient use of precious trained annotator resources by leveraging highly available and cost effective crowdsourcing worker pools in a way that does not sacrifice quality.