Using some web content mining techniques for Arabic text classification

  • Authors:
  • Zakaria Suliman Zubi

  • Affiliations:
  • Computer Science Department, Faculty of Science, Al-Tahadi University, Sirt, Libya

  • Venue:
  • DNCOCO'09 Proceedings of the 8th WSEAS international conference on Data networks, communications, computers
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the massive rise in the volume of information available on the World Wide Web these days, and the emergence requirements for a superior technique to access this information, there has been a strong resurgence of interest in web mining research. Web mining is a critical issue in data mining as well as other information process techniques to the World Wide Web to discover useful patterns. People can take advantage of these patterns to access the World Wide Web more efficiently. Web mining can be divided into three categories such as content mining, usage mining, and structure mining. In this paper we are going to apply web content mining to extract non-English knowledge from the web. We will investigate and evaluate some common methods; using web mining systems which have to deal with issues in language-specific text processing. Arabic language-independent algorithm will be used as a machine learning system. The algorithm will use a set of features as a vector of keywords for the learning process to apply text classification for the system. The algorithm usually used to classify a various number of documents written in a non English text language. The techniques used in the algorithm to categorize and classified the documents are two classifiers: Classifier K-Nearest Neighbor (CK-NN) and Classifier Naïve Bayes (CNB). However, the algorithms usually depend on some phrase segmentation and extraction programs to generate a set of features or keywords to represent the retrieved web documents. A proposed Arabic text classification system will be called Arabic Text Classifier (ATC). The main goal of ATC is to compares the results between both classifiers used (CKNN, CNB) and select the best average accuracy result rates to start a retrieving process. The theorem behind the ATC was introduced in this paper without demonstrating any practical views of the system.