Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Support vector machine active learning with applications to text classification
The Journal of Machine Learning Research
Probability Estimates for Multi-class Classification by Pairwise Coupling
The Journal of Machine Learning Research
An EM Based Training Algorithm for Cross-Language Text Categorization
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
A note on Platt's probabilistic outputs for support vector machines
Machine Learning
Mining multilingual topics from wikipedia
Proceedings of the 18th international conference on World wide web
Cross language text categorization by acquiring multilingual domain models from comparable corpora
ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
Co-training for cross-lingual sentiment classification
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Cross language text classification by model translation and semi-supervised learning
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
LIBSVM: A library for support vector machines
ACM Transactions on Intelligent Systems and Technology (TIST)
Document Clustering via Matrix Representation
ICDM '11 Proceedings of the 2011 IEEE 11th International Conference on Data Mining
Enriching short text representation in microblog for clustering
Frontiers of Computer Science in China
Hi-index | 0.00 |
Cross Language Text Categorization (CLTC) is the task of assigning class labels to documents written in a target language (e.g. Chinese) while the system is trained using labeled examples in a source language (e.g. English). With the technique of CLTC, we can build classifiers for multiple languages employing the existing training data in only one language, therefore avoid the cost of preparing training data for each individual language. One challenge for CLTC is the culture differences between languages, which causes the classifier trained on the source language doesn't perform well on the target language. In this paper, we propose an active learning algorithm for CLTC, which takes full advantage of both labeled data in the source language and unlabeled data in the target language. The classifier first learns the classification knowledge from the source language, and then learns the cultural dependent knowledge from the target language. In addition, we extend our algorithm to double viewed form by considering the source and target language as two views of the classification problem. Experiments show that our algorithm can effectively improve the cross language classification performance.