Learning multilingual named entity recognition from Wikipedia

Authors:
Joel Nothman;Nicky Ringland;Will Radford;Tara Murphy;James R. Curran
Affiliations:
School of Information Technologies, University of Sydney, NSW 2006, Australia and Capital Markets CRC, 55 Harrington Street, NSW 2000, Australia;School of Information Technologies, University of Sydney, NSW 2006, Australia;School of Information Technologies, University of Sydney, NSW 2006, Australia and Capital Markets CRC, 55 Harrington Street, NSW 2000, Australia;School of Information Technologies, University of Sydney, NSW 2006, Australia;School of Information Technologies, University of Sydney, NSW 2006, Australia and Capital Markets CRC, 55 Harrington Street, NSW 2000, Australia
Venue:
Artificial Intelligence
Year:
2013

Citing 47
Cited 2

An Algorithm that Learns What‘s in a Name

Machine Learning - Special issue on natural language learning
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Detecting errors in part-of-speech annotation

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Inducing multilingual text analysis tools via robust projection across aligned corpora

HLT '01 Proceedings of the first international conference on Human language technology research
Automatic acquisition of named entity tagged corpus from world wide web

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
NLTK: the Natural Language Toolkit

ETMTNLP '02 Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1
Introduction to the CoNLL-2002 shared task: language-independent named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
The multilingual entity task (MET) overview

TIPSTER '96 Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Language independent NER using a maximum entropy tagger

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
A high-performance semi-supervised learning method for text chunking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics
Identifying Document Topics Using the Wikipedia Category Network

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Autonomously semantifying wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Wikify!: linking documents to encyclopedic knowledge

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Measuring article quality in wikipedia: models and evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Learning to Tag and Tagging to Learn: A Case Study on Wikipedia

IEEE Intelligent Systems
Information arbitrage across multi-lingual Wikipedia

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Cross-lingual alignment and completion of Wikipedia templates

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Design challenges and misconceptions in named entity recognition

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning
Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
WikiRelate! computing semantic relatedness using wikipedia

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Comparison between tagged corpora for the named entity task

CompareCorpora '00 Proceedings of the Workshop on Comparing Corpora
Analysing Wikipedia and gold-standard corpora for NER training

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A simple semi-supervised algorithm for named entity recognition

SemiSupLearn '09 Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing
One class per named entity: exploiting unlabeled text for named entity recognition

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence
Large-scale taxonomy mapping for restructuring and integrating wikipedia

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Domain adaptive bootstrapping for named entity recognition

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Named entity recognition in Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
BabelNet: building a very large multilingual semantic network

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Annotating large email datasets for named entity recognition with Mechanical Turk

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
A hybrid model for annotating named entity training corpora

LAW IV '10 Proceedings of the Fourth Linguistic Annotation Workshop
MENTA: inducing multilingual taxonomies from wikipedia

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Classifying Wikipedia entities into fine-grained classes

ICDEW '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering Workshops
Learning from partially annotated sequences

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part I
Training a named entity recognizer on the web

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Automatic assignment of wikipedia encyclopedic entries to wordnet synsets

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
Active learning with Amazon Mechanical Turk

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity

AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence
Applying wikipedia's multilingual knowledge to cross-lingual question answering

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Language Resources and Evaluation

Collective information extraction using first-order probabilistic models

Proceedings of the Fifth Balkan Conference in Informatics
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy. We transform the links between articles into ne annotations by projecting the target article@?s classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.