A co-classification approach to learning from multilingual corpora

Authors:
Massih-Reza Amini;Cyril Goutte
Affiliations:
Interactive Language Technologies group, National Research Council Canada, Gatineau, Canada J8X 3X7;Interactive Language Technologies group, National Research Council Canada, Gatineau, Canada J8X 3X7
Venue:
Machine Learning
Year:
2010

Citing 0
Cited 7

Combining coregularization and consensus-based self-training for multilingual text categorization

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Knowledge transfer across multilingual corpora via latent topics

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Bilingual co-training for sentiment classification of chinese product reviews

Computational Linguistics
Integration of three visualization methods based on co-word analysis

Scientometrics
Fast on-line learning for multilingual categorization

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Improving document clustering using automated machine translation

Proceedings of the 21st ACM international conference on Information and knowledge management
A scatter method for data and variable importance evaluation

Integrated Computer-Aided Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.