C4.5: programs for machine learning
C4.5: programs for machine learning
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Improvements to Platt's SMO Algorithm for SVM Classifier Design
Neural Computation
Text classification: A least square support vector machine approach
Applied Soft Computing
Statistical Comparisons of Classifiers over Multiple Data Sets
The Journal of Machine Learning Research
Discriminative parameter learning for Bayesian networks
Proceedings of the 25th international conference on Machine learning
Introduction to Information Retrieval
Introduction to Information Retrieval
Automated Classification and Categorization of Mathematical Knowledge
Proceedings of the 9th AISC international conference, the 15th Calculemas symposium, and the 7th international MKM conference on Intelligent Computer Mathematics
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
STEMBR: a stemming algorithm for the Brazilian Portuguese language
EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence
Tools for nominalization: an alternative for lexical normalization
PROPOR'06 Proceedings of the 7th international conference on Computational Processing of the Portuguese Language
Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications
Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications
Hi-index | 0.00 |
Text classification is an important task of Artificial Intelligence. Normally, this task uses large textual datasets whose representation is feasible because of normalization and selection techniques. In the literature, we can find three normalization techniques: stemming, lemmatization, and nominalization. Nevertheless, it is difficult to choose the most suitable technique for the text classification task. In this paper, we investigate this question experimentally by applying five different classifiers to four textual datasets in the Portuguese language. Additionally, the classification results are evaluated using unigrams, bigrams, and the combination of unigrams and bigrams. The results indicate that, in general, the number of terms obtained by each of the cases and the comprehensibility required in the results of the classification can be used as criteria to define the most suitable technique for the text classification task.