Using information from the target language to improve crosslingual text classification

  • Authors:
  • Gabriela Ramírez-de-la-Rosa;Manuel Montes-y-Gómez;Luis Villaseñor-Pineda;David Pinto-Avendaño;Thamar Solorio

  • Affiliations:
  • Laboratory of Language Technologies, National Institute for Astrophysics, Optics and Electronics;Laboratory of Language Technologies, National Institute for Astrophysics, Optics and Electronics;Laboratory of Language Technologies, National Institute for Astrophysics, Optics and Electronics;Faculty of Computer Science, Autonomous University of Puebla;Department of Computer and Information Sciences, University of Alabama at Birmingham

  • Venue:
  • IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Crosslingual text classification consists of exploiting labeled documents in a source language to classify documents in a different target language. In addition to the evident translation problem, this task also faces some difficulties caused by the cultural discrepancies manifested in both languages by means of different topic distributions. Such discrepancies make the classifier unreliable for the categorization task. In order to tackle this problem we propose to improve the classification performance by using information embedded in the own target dataset. The central idea of the proposed approach is that similar documents must belong to the same category. Therefore, it classifies the documents by considering not only their own content but also information about the assigned category to other similar documents from the same target dataset. Experimental results using three different languages evidence the appropriateness of the proposed approach.