A simple, structure-sensitive approach for web document classification

Authors:
Alex Markov;Mark Last
Affiliations:
Department of Information Systems Engineering, Ben-Gurion University of Negev, Beer-Sheva, Israel;Department of Information Systems Engineering, Ben-Gurion University of Negev, Beer-Sheva, Israel
Venue:
AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
Year:
2005

Citing 10
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
Data clustering: a review

ACM Computing Surveys (CSUR)
Data mining: concepts and techniques

Data mining: concepts and techniques
A vector space model for automatic indexing

Communications of the ACM
Machine Learning

Machine Learning
Knowledge Discovery and Data Mining: The Info-Fuzzy Network (Ifn) Methodology

Knowledge Discovery and Data Mining: The Info-Fuzzy Network (Ifn) Methodology
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Induction of Decision Trees

Machine Learning
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Graph-theoretic techniques for web content mining

Graph-theoretic techniques for web content mining

Fast categorization of web documents represented by graphs

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Multi-lingual detection of terrorist content on the web

WISI'06 Proceedings of the 2006 international conference on Intelligence and Security Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we describe a new approach to classification of web documents. Most web classification methods are based on the vector space document representation of information retrieval. Recently the graph based web document representation model was shown to outperform the traditional vector representation using k-Nearest Neighbor (k-NN) classification algorithm. Here we suggest a new hybrid approach to web document classification built upon both, graph and vector representations. K-NN algorithm and three benchmark document collections were used to compare this method to graph and vector based methods separately. Results demonstrate that we succeed in most cases to outperform graph and vector approaches in terms of classification accuracy along with a significant reduction in classification time.