Using information from the target language to improve crosslingual text classification

Authors:
Gabriela Ramírez-de-la-Rosa;Manuel Montes-y-Gómez;Luis Villaseñor-Pineda;David Pinto-Avendaño;Thamar Solorio
Affiliations:
Laboratory of Language Technologies, National Institute for Astrophysics, Optics and Electronics;Laboratory of Language Technologies, National Institute for Astrophysics, Optics and Electronics;Laboratory of Language Technologies, National Institute for Astrophysics, Optics and Electronics;Faculty of Computer Science, Autonomous University of Puebla;Department of Computer and Information Sciences, University of Alabama at Birmingham
Venue:
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Year:
2010

Citing 9
Cited 2

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Centroid-Based Document Classification: Analysis and Experimental Results

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Semi-supervised single-label text categorization using centroid-based classifiers

Proceedings of the 2007 ACM symposium on Applied computing
Can chinese web pages be classified with english data source?

Proceedings of the 17th international conference on World Wide Web
Co-training for cross-lingual sentiment classification

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Multilingual text classification using ontologies

ECIR'07 Proceedings of the 29th European conference on IR research

Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Crosslingual text classification consists of exploiting labeled documents in a source language to classify documents in a different target language. In addition to the evident translation problem, this task also faces some difficulties caused by the cultural discrepancies manifested in both languages by means of different topic distributions. Such discrepancies make the classifier unreliable for the categorization task. In order to tackle this problem we propose to improve the classification performance by using information embedded in the own target dataset. The central idea of the proposed approach is that similar documents must belong to the same category. Therefore, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same target dataset. Experimental results using three different languages evidence the appropriateness of the proposed approach.