Mining multilingual topics from wikipedia

Authors:
Xiaochuan Ni;Jian-Tao Sun;Jian Hu;Zheng Chen
Affiliations:
Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 18th international conference on World wide web
Year:
2009

Citing 3
Cited 18

Latent dirichlet allocation

The Journal of Machine Learning Research
Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Bilingual topic aspect classification with a few training examples

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Cross-language linking of news stories on the web using interlingual topic modelling

Proceedings of the 2nd ACM workshop on Social web search and mining
Polylingual topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Multilingual topic models for unaligned text

UAI '09 Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Term weighting schemes for Latent Dirichlet Allocation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Cross-lingual latent topic extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A case for query by image and text content: searching computer help using screenshots and keywords

Proceedings of the 20th international conference on World wide web
A multi-view approach for term translation spotting

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Identifying word translations from comparable corpora using latent topic models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Knowledge transfer across multilingual corpora via latent topics

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Extracting multilingual topics from unaligned comparable corpora

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Combining wikipedia-based concept models for cross-language retrieval

IRFC'10 Proceedings of the First international Information Retrieval Facility conference on Adbances in Multidisciplinary Retrieval
Cross-language information retrieval with latent topic models trained on a comparable corpus

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Active learning for cross language text categorization

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Detecting highly confident word translations from comparable corpora without any prior knowledge

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Enabling the discovery of digital cultural heritage objects through Wikipedia

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Using similes to extract basic sentiments across languages

WISM'12 Proceedings of the 2012 international conference on Web Information Systems and Mining
Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we try to leverage a large-scale and multilingual knowledge base, Wikipedia, to help effectively analyze and organize Web information written in different languages. Based on the observation that one Wikipedia concept may be described by articles in different languages, we adapt existing topic modeling algorithm for mining multilingual topics from this knowledge base. The extracted 'universal' topics have multiple types of representations, with each type corresponding to one language. Accordingly, new documents of different languages can be represented in a space using a group of universal topics, which makes various multilingual Web applications feasible.