Explicit versus latent concept models for cross-language information retrieval

Authors:
Philipp Cimiano;Antje Schultz;Sergej Sizov;Philipp Sorg;Steffen Staab
Affiliations:
WIS, TU Delft;ISWeb, Univ. Koblenz-Landau;ISWeb, Univ. Koblenz-Landau;AIFB, Univ. Karlsruhe;ISWeb, Univ. Koblenz-Landau
Venue:
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Year:
2009

Citing 8
Cited 21

The vocabulary problem in human-system communication

Communications of the ACM
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Ontologies Improve Text Document Clustering

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Dictionary-based techniques for cross-language information retrieval

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Empirical methods for compound splitting

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Term weighting schemes for Latent Dirichlet Allocation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Handling noisy queries in cross language FAQ retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A late fusion approach to cross-lingual document re-ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Dual-space re-ranking model for document retrieval

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Combining heterogeneous knowledge resources for improved distributional semantic models

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I
Taxonomy induction based on a collaboratively built knowledge repository

Artificial Intelligence
What Makes a Phone a Business Phone - Querying Concepts in Product Data

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Insights into explicit semantic analysis

Proceedings of the 20th ACM international conference on Information and knowledge management
Combining wikipedia-based concept models for cross-language retrieval

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Cross-language information retrieval with latent topic models trained on a comparable corpus

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Exploiting Wikipedia for cross-lingual and multilingual information retrieval

Data & Knowledge Engineering
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
Detecting highly confident word translations from comparable corpora without any prior knowledge

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

Artificial Intelligence
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

Artificial Intelligence
Monolingual and cross-lingual probabilistic topic models and their applications in information retrieval

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval
Development and evaluation of a biomedical search engine using a predicate-based vector space model

Journal of Biomedical Informatics
Querying concepts in product data by means of query expansion

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The field of information retrieval and text manipulation (classification, clustering) still strives for models allowing semantic information to be folded in to improve performance with respect to standard bag-of-word based models. Many approaches aim at a concept-based retrieval, but differ in the nature of the concepts, which range from linguistic concepts as defined in lexical resources such as WordNet, latent topics derived from the data itself - as in Latent Semantic Indexing (LSI) or (Latent Dirichlet Allocation (LDA) - to Wikipedia articles as proxies for concepts, as in the recently proposed Explicit Semantic Analysis (ESA) model. A crucial question which has not been answered so far is whether models based on explicitly given concepts (as in the ESA model for instance) perform inherently better than retrieval models based on "latent" concepts (as in LSI and/or LDA). In this paper we investigate this question closer in the context of a cross-language setting, which inherently requires concept-based retrieval bridging between different languages. In particular, we compare the recently proposed ESA model with two latent models (LSI and LDA) showing that the former is clearly superior to the both. From a general perspective, our results contribute to clarifying the role of explicit vs. implicitly derived or latent concepts in (cross-language) information retrieval research.