Cross lingual text classification by mining multilingual topics from wikipedia

Authors:
Xiaochuan Ni;Jian-Tao Sun;Jian Hu;Zheng Chen
Affiliations:
Microsoft Research Asia, Beijing, China;Microsoft Research Asia, Beijing, China;Tencent Soso, Beijing, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the fourth ACM international conference on Web search and data mining
Year:
2011

Citing 19
Cited 11

The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Latent dirichlet allocation

The Journal of Machine Learning Research
Identifying word translations in non-parallel texts

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Extraction of lexical translations from non-aligned corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Mining comparable bilingual text corpora for cross-language information integration

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Processing comparable corpora with Bilingual Suffix Trees

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Combining bidirectional translation and synonymy for cross-language information retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting parallel sub-sentential fragments from non-parallel corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Topic sentiment mixture: modeling facets and opinions in weblogs

Proceedings of the 16th international conference on World Wide Web
Learning to classify short and sparse text & web with hidden topics from large-scale data collections

Proceedings of the 17th international conference on World Wide Web
Modeling online reviews with multi-grain topic models

Proceedings of the 17th international conference on World Wide Web
Can chinese web pages be classified with english data source?

Proceedings of the 17th international conference on World Wide Web
Enhancing text clustering by leveraging Wikipedia semantics

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Bilingual topic aspect classification with a few training examples

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Deriving a large scale taxonomy from Wikipedia

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Computing semantic relatedness using Wikipedia-based explicit semantic analysis

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

It's who you know: graph mining using recursive structural features

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Cross-lingual text classification with model translation and document translation

Proceedings of the 50th Annual Southeast Regional Conference
Exploiting Wikipedia for cross-lingual and multilingual information retrieval

Data & Knowledge Engineering
Towards building a multilingual semantic network: identifying interlingual links in Wikipedia

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
Exploiting semantic annotations in math information retrieval

Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval
A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Monolingual and cross-lingual probabilistic topic models and their applications in information retrieval

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Discovering multilingual concepts from unaligned web documents by exploring associated images

Proceedings of the 22nd international conference on World Wide Web companion
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
Cross lingual entity linking with bilingual topic model

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Multilingual probabilistic topic modeling and its applications in web mining and search

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper investigates how to effectively do cross lingual text classification by leveraging a large scale and multilingual knowledge base, Wikipedia. Based on the observation that each Wikipedia concept is described by documents of different languages, we adapt existing topic modeling algorithms for mining multilingual topics from this knowledge base. The extracted topics have multiple types of representations, with each type corresponding to one language. In this work, we regard such topics extracted from Wikipedia documents as universal-topics, since each topic corresponds with same semantic information of different languages. Thus new documents of different languages can be represented in a space using a group of universal-topics. We use these universal-topics to do cross lingual text classification. Given the training data labeled for one language, we can train a text classifier to classify the documents of another language by mapping all documents of both languages into the universal-topic space. This approach does not require any additional linguistic resources, like bilingual dictionaries, machine translation tools, or labeling data for the target language. The evaluation results indicate that our topic modeling approach is effective for building cross lingual text classifier.