Evaluation of normalization techniques in text classification for portuguese

Authors:
Merley da Silva Conrado;Víctor Antonio Laguna Gutiérrez;Solange Oliveira Rezende
Affiliations:
Sao Paulo University (USP), Sao Carlos, SP, Brazil;Pontifical Catholic University of Peru (PUCP), Lima, Peru;Sao Paulo University (USP), Sao Carlos, SP, Brazil
Venue:
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part III
Year:
2012

Citing 14
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Improvements to Platt's SMO Algorithm for SVM Classifier Design

Neural Computation
Text classification: A least square support vector machine approach

Applied Soft Computing
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
Discriminative parameter learning for Bayesian networks

Proceedings of the 25th international conference on Machine learning
Introduction to Information Retrieval

Introduction to Information Retrieval
Automated Classification and Categorization of Mathematical Knowledge

Proceedings of the 9th AISC international conference, the 15th Calculemas symposium, and the 7th international MKM conference on Intelligent Computer Mathematics
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
N-grams and morphological normalization in text classification: a comparison on a Croatian-English parallel corpus

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
STEMBR: a stemming algorithm for the Brazilian Portuguese language

EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence
Tools for nominalization: an alternative for lexical normalization

PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text classification is an important task of Artificial Intelligence. Normally, this task uses large textual datasets whose representation is feasible because of normalization and selection techniques. In the literature, we can find three normalization techniques: stemming, lemmatization, and nominalization. Nevertheless, it is difficult to choose the most suitable technique for the text classification task. In this paper, we investigate this question experimentally by applying five different classifiers to four textual datasets in the Portuguese language. Additionally, the classification results are evaluated using unigrams, bigrams, and the combination of unigrams and bigrams. The results indicate that, in general, the number of terms obtained by each of the cases and the comprehensibility required in the results of the classification can be used as criteria to define the most suitable technique for the text classification task.