The nature of statistical learning theory
The nature of statistical learning theory
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Support Vector Machines for Text Categorization
HICSS '03 Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 4 - Volume 4
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Information Theory, Inference & Learning Algorithms
Information Theory, Inference & Learning Algorithms
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Automatic Arabic document categorization based on the Naïve Bayes algorithm
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
The Automatic Categorization of Arabic Documents by Boosting Decision Trees
SITIS '09 Proceedings of the 2009 Fifth International Conference on Signal Image Technology and Internet Based Systems
Hi-index | 0.00 |
The Arabic language is a highly flexional and morphologically very rich language. It presents serious challenges to the automatic classification of documents, one of which is determining what type of attribute to use in order to get the optimal classification results. Some people use roots or lemmas which, they say, are able to handle problems with the inflections that do not appear in other languages in that fashion. Others prefer to use character-level n-grams since n-grams are simpler to implement, language independent, and produce satisfactory results. So which of these two approaches is better, if any? This paper tries to answer this question by offering a comparative study between four feature types: words in their original form, lemmas, roots, and character level n-grams and shows how each affects the performance of the classifier. We used and compared the performance of Support Vector Machines and Naïve Bayesian Networks algorithms respectively.