Extracting multilingual topics from unaligned comparable corpora

Authors:
Jagadeesh Jagarlamudi;Hal Daumé
Affiliations:
School of Computing, University of Utah;School of Computing, University of Utah
Venue:
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Year:
2010

Citing 9
Cited 13

Evaluating a probabilistic model for cross-lingual information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A systematic comparison of various statistical alignment models

Computational Linguistics
Modeling annotated data

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Learning a translation lexicon from monolingual corpora

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
Mining correlated bursty topic patterns from coordinated text streams

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Mining multilingual topics from wikipedia

Proceedings of the 18th international conference on World wide web
Multilingual topic models for unaligned text

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

Cross-lingual latent topic extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Translingual document representations from discriminative projections

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
From bilingual dictionaries to interlingual document representations

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Recent developments in information retrieval

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Improving bilingual projections via sparse covariance matrices

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Modeling click-through based word-pairs for web search

Proceedings of the 21st international conference companion on World Wide Web
Cross-language information retrieval with latent topic models trained on a comparable corpus

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Latent association analysis of document pairs

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient tree-based topic modeling

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Monolingual and cross-lingual probabilistic topic models and their applications in information retrieval

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Modeling click-through based word-pairs for web search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Topic models have been studied extensively in the context of monolingual corpora. Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments. In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus. Experiments conducted on different data sets confirm our conjecture that jointly modeling the cross-lingual corpora offers several advantages compared to individual monolingual models. Since the JointLDA model merges related topics in different languages into a single multilingual topic: a) it can fit the data with relatively fewer topics. b) it has the ability to predict related words from a language different than that of the given document. In fact it has better predictive power compared to the bag-of-word based translation model leaving the possibility for JointLDA to be preferred over bag-of-word model for Cross-Lingual IR applications. We also found that the monolingual models learnt while optimizing the cross-lingual copora are more effective than the corresponding LDA models.