Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

Authors:
Antonio Toral;Sergio Ferrández;Monica Monachini;Rafael Muñoz
Affiliations:
NCLT, School of Computing, Dublin City University, Dublin, Ireland;Natural Language Processing and Information Systems Group, Department of Computing Languages and Systems, University of Alicante, Alicante, Spain 03080;Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche, Pisa, Italy;Natural Language Processing and Information Systems Group, Department of Computing Languages and Systems, University of Alicante, Alicante, Spain 03080
Venue:
Language Resources and Evaluation
Year:
2012

Citing 31
Cited 1

Processing dictionary definitions with phrasal pattern hierarchies

Computational Linguistics - Special issue of the lexicon
The generative lexicon

Computational Linguistics
WordNet: a lexical database for English

Communications of the ACM
EuroWordNet: a multilingual database with lexical semantic networks

EuroWordNet: a multilingual database with lexical semantic networks
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Acquiring and Representing Semantic Information in a Lexical Knowledge Base

Proceedings of the First SIGLEX Workshop on Lexical Semantics and Knowledge Representation
MindNet: acquiring and structuring semantic information from text

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Extraction of semantic information from an ordinary English dictionary and its evaluation

COLING '88 Proceedings of the 12th conference on Computational linguistics - Volume 2
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Offline strategies for online question answering: answering questions before they are asked

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Learning Subjective Language

Computational Linguistics
Fine-grained proper noun ontologies for question answering

SEMANET '02 Proceedings of the 2002 workshop on Building and using semantic networks - Volume 11
Introduction to the CoNLL-2002 shared task: language-independent named entity recognition

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
WordNet Nouns: Classes and Instances

Computational Linguistics
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Semantic taxonomy induction from heterogenous evidence

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Mining Domain-Specific Thesauri from Wikipedia: A Case Study

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
Mapping concrete entities from PAROLE-SIMPLE-CLIPS to ItalWordNet: methodology and results

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Personalizing PageRank for word sense disambiguation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Analysing Wikipedia and gold-standard corpora for NER training

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Knowledge derived from wikipedia for computing semantic relatedness

Journal of Artificial Intelligence Research
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A perspective-based approach for solving textual entailment recognition

RTE '07 Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Creating subjective and objective sentence classifiers from unannotated texts

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
REPENTINO – a wide-scope gazetteer for entity recognition in portuguese

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Overview of the CLEF 2006 multilingual question answering track

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Monolingual and cross-lingual QA using AliQAn and BRILI systems for CLEF 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Applying wikipedia's multilingual knowledge to cross-lingual question answering

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems

Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (1) the knowledge available in existing LRs, (2) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (3) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system's accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented.