Using Nearest Neighbor Information to Improve Cross-Language Text Classification

Authors:
Adelina Escobar-Acevedo;Manuel Montes-Y-Gómez;Luis Villaseñor-Pineda
Affiliations:
Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics (INAOE), Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics (INAOE), Mexico;Laboratory of Language Technologies, Department of Computational Sciences, National Institute of Astrophysics, Optics and Electronics (INAOE), Mexico
Venue:
MICAI '09 Proceedings of the 8th Mexican International Conference on Artificial Intelligence
Year:
2009

Citing 10
Cited 1

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Cross-language text classification

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
An EM Based Training Algorithm for Cross-Language Text Categorization

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Co-clustering based classification for out-of-domain documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Using the Web as corpus for self-training text categorization

Information Retrieval
Multilingual text classification using ontologies

ECIR'07 Proceedings of the 29th European conference on IR research
Semi-supervised document classification with a mislabeling error model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

A document is known by the company it keeps: neighborhood consensus for short text categorization

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cross-language text classification (CLTC) aims to take advantage of existing training data from one language to construct a classifier for another language. In addition to the expected translation issues, CLTC is also complicated by the cultural distance between both languages, which causes that documents belonging to the same category concern very different topics. This paper proposes a re-classification method which purpose is to reduce the errors caused by this phenomenon by considering information from the own target language documents. Experimental results in a news corpus considering three pairs of languages and four categories demonstrated the appropriateness of the proposed method, which could improve the initial classification accuracy by up to 11%.