Automatic Arabic document categorization based on the Naïve Bayes algorithm

Authors:
Mohamed El Kourdi;Amine Bensaid;Tajje-eddine Rachidi
Affiliations:
Alakhawayn University, Ifrane, Morocco;Alakhawayn University, Ifrane, Morocco;Alakhawayn University, Ifrane, Morocco
Venue:
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Year:
2004

Citing 14
Cited 14

Trading MIPS and memory for knowledge engineering

Communications of the ACM
Automatic indexing based on Bayesian inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
Modern Information Retrieval

Modern Information Retrieval
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms

Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
High-performing feature selection for text classification

Proceedings of the eleventh international conference on Information and knowledge management
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A family of additive online algorithms for category ranking

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
A computational morphology system for Arabic

Semitic '98 Proceedings of the Workshop on Computational Approaches to Semitic Languages

Machine learning for Arabic text categorization: Research Articles

Journal of the American Society for Information Science and Technology
Support vector machines based Arabic language text classification system: feature selection comparative study

MATH'07 Proceedings of the 12th WSEAS International Conference on Applied Mathematics
A novel Arabic lemmatization algorithm

Proceedings of the second workshop on Analytics for noisy unstructured text data
Feature reduction techniques for Arabic text categorization

Journal of the American Society for Information Science and Technology
Using some web content mining techniques for Arabic text classification

DNCOCO'09 Proceedings of the 8th WSEAS international conference on Data networks, communications, computers
Automatically classifying documents by ideological and organizational affiliation

ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Estimating the size and evolution of categorised topics in web directories

Web Intelligence and Agent Systems
A comparative study for Arabic text classification algorithms based on stop words elimination

Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications
Feature sub-set selection metrics for Arabic text classification

Pattern Recognition Letters
An empirical study on the feature's type effect on the automatic classification of arabic documents

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Effect of ISRI stemming on similarity measure for arabic document clustering

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Topic detection and multi-word terms extraction for arabic unvowelized documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
The Effect of Stemming on Arabic Text Classification: An Empirical Study

International Journal of Information Retrieval Research
An Experimental Study for the Effect of Stop Words Elimination for Arabic Text Classification Algorithms

International Journal of Information Technology and Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with automatic classification of Arabic web documents. Such a classification is very useful for affording directory search functionality, which has been used by many web portals and search engines to cope with an ever-increasing number of documents on the web. In this paper, Naive Bayes (NB) which is a statistical machine learning algorithm, is used to classify non-vocalized Arabic web documents (after their words have been transformed to the corresponding canonical form, i.e., roots) to one of five pre-defined categories. Cross validation experiments are used to evaluate the NB categorizer. The data set used during these experiments consists of 300 web documents per category. The results of cross validation in the leave-one-out experiment show that, using 2,000 terms/roots, the categorization accuracy varies from one category to another with an average accuracy over all categories of 68.78 %. Furthermore, the best categorization performance by category during cross validation experiments goes up to 92.8%. Further tests carried out on a manually collected evaluation set which consists of 10 documents from each of the 5 categories, show that the overall classification accuracy achieved over all categories is 62%, and that the best result by category reaches 90%.