Analysing Wikipedia and gold-standard corpora for NER training

Authors:
Joel Nothman;Tara Murphy;James R. Curran
Affiliations:
University of Sydney, Australia;University of Sydney, Australia;University of Sydney, Australia
Venue:
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Year:
2009

Citing 9
Cited 13

Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Detecting errors in part-of-speech annotation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Ranking algorithms for named-entity extraction: boosting and the voted perceptron

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic acquisition of named entity tagged corpus from world wide web

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Introduction to the CoNLL-2002 shared task: language-independent named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Language independent NER using a maximum entropy tagger

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity

AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence

Named entity recognition in Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
A hybrid model for annotating named entity training corpora

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
Self-annotation for fine-grained geospatial relation extraction

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Learning from partially annotated sequences

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
A resource-based method for named entity extraction and classification

EPIA'11 Proceedings of the 15th Portugese conference on Progress in artificial intelligence
Named entity disambiguation based on explicit semantics

SOFSEM'12 Proceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science
Recall-oriented learning of named entities in Arabic Wikipedia

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Automatically generated NE tagged corpora for English and Hungarian

NEWS '12 Proceedings of the 4th Named Entity Workshop
Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Language Resources and Evaluation
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Evaluating Entity Linking with Wikipedia

Artificial Intelligence
Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence
A Named Entity Recognition Method Based on Decomposition and Concatenation of Word Chunks

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Named entity recognition (ner) for English typically involves one of three gold standards: muc, conll, or bbn, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text. We present the first comprehensive cross-corpus evaluation of ner. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corpora on cross-corpus evaluation by up to 11%.