Analysing Wikipedia and gold-standard corpora for NER training

  • Authors:
  • Joel Nothman;Tara Murphy;James R. Curran

  • Affiliations:
  • University of Sydney, Australia;University of Sydney, Australia;University of Sydney, Australia

  • Venue:
  • EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Named entity recognition (ner) for English typically involves one of three gold standards: muc, conll, or bbn, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text. We present the first comprehensive cross-corpus evaluation of ner. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corpora on cross-corpus evaluation by up to 11%.