Evaluation of normalization techniques in text classification for portuguese

  • Authors:
  • Merley da Silva Conrado;Víctor Antonio Laguna Gutiérrez;Solange Oliveira Rezende

  • Affiliations:
  • Sao Paulo University (USP), Sao Carlos, SP, Brazil;Pontifical Catholic University of Peru (PUCP), Lima, Peru;Sao Paulo University (USP), Sao Carlos, SP, Brazil

  • Venue:
  • ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text classification is an important task of Artificial Intelligence. Normally, this task uses large textual datasets whose representation is feasible because of normalization and selection techniques. In the literature, we can find three normalization techniques: stemming, lemmatization, and nominalization. Nevertheless, it is difficult to choose the most suitable technique for the text classification task. In this paper, we investigate this question experimentally by applying five different classifiers to four textual datasets in the Portuguese language. Additionally, the classification results are evaluated using unigrams, bigrams, and the combination of unigrams and bigrams. The results indicate that, in general, the number of terms obtained by each of the cases and the comprehensibility required in the results of the classification can be used as criteria to define the most suitable technique for the text classification task.