Combining coregularization and consensus-based self-training for multilingual text categorization
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Knowledge transfer across multilingual corpora via latent topics
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Bilingual co-training for sentiment classification of chinese product reviews
Computational Linguistics
Fast on-line learning for multilingual categorization
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Improving document clustering using automated machine translation
Proceedings of the 21st ACM international conference on Information and knowledge management
A scatter method for data and variable importance evaluation
Integrated Computer-Aided Engineering
Hi-index | 0.00 |
We address the problem of learning text categorization from a corpus of multilingual documents. We propose a multiview learning, co-regularization approach, in which we consider each language as a separate source, and minimize a joint loss that combines monolingual classification losses in each language while ensuring consistency of the categorization across languages. We derive training algorithms for logistic regression and boosting, and show that the resulting categorizers outperform models trained independently on each language, and even, most of the times, models trained on the joint bilingual data. Experiments are carried out on a multilingual extension of the RCV2 corpus, which is available for benchmarking.