The hybrid representation model for web document classification

Authors:
A. Markov;M. Last;A. Kandel
Affiliations:
Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel;Department of Information Systems Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel;Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620
Venue:
International Journal of Intelligent Systems
Year:
2008

Citing 0
Cited 4

Use of Medical Subject Headings (MeSH) in Portuguese for categorizing web-based healthcare content

Journal of Biomedical Informatics
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
CoBAn: A context based model for data leakage prevention

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most Web content categorization methods are based on the vector space model of information retrieval. One of the most important advantages of this representation model is that it can be used by both instance-based and model-based classifiers. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the markup information that can easily be extracted from the Web document HTML tags. A recently developed graph-based Web document representation model can preserve Web document structural information. It was shown to outperform the traditional vector representation using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this article, three new hybrid approaches to Web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using the C4.5 decision tree and the probabilistic Naïve Bayes classifiers on several benchmark Web document collections. The results demonstrate that the hybrid methods presented in this article outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time. © 2008 Wiley Periodicals, Inc.