Exploiting Wikipedia for cross-lingual and multilingual information retrieval

Authors:
P. Sorg;P. Cimiano
Affiliations:
Institut AIFB, KIT (Campus Süd), 76128 Karlsruhe, Germany;Semantic Computing Group, Cognitive Interaction Technology Excellence Center (CITEC), University of Bielefeld, 33615 Bielefed, Germany
Venue:
Data & Knowledge Engineering
Year:
2012

Citing 17
Cited 2

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval

AMTA '98 Proceedings of the Third Conference of the Association for Machine Translation in the Americas on Machine Translation and the Information Soup
Latent dirichlet allocation

The Journal of Machine Learning Research
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Dictionary-based techniques for cross-language information retrieval

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Text categorization with knowledge transfer from heterogeneous data sources

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Concept-based feature generation and selection for information retrieval

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Feature generation for text categorization using world knowledge

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Explicit versus latent concept models for cross-language information retrieval

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Cross lingual text classification by mining multilingual topics from wikipedia

Proceedings of the fourth ACM international conference on Web search and data mining
Cross-language plagiarism detection

Language Resources and Evaluation
Concept-Based Information Retrieval Using Explicit Semantic Analysis

ACM Transactions on Information Systems (TOIS)
Insights into explicit semantic analysis

Proceedings of the 20th ACM international conference on Information and knowledge management
An experimental comparison of explicit semantic analysis implementations for cross-language retrieval

NLDB'09 Proceedings of the 14th international conference on Applications of Natural Language to Information Systems

Cross-lingual query expansion in multilingual folksonomies: A case study on Flickr

Knowledge-Based Systems
Editorial: Minimally-supervised learning of domain-specific causal relations using an open-domain corpus as knowledge base

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results.