Classification of Web Documents Using a Graph-Based Model and Structural Patterns

  • Authors:
  • Andrzej Dominik;Zbigniew Walczak;Jacek Wojciechowski

  • Affiliations:
  • Warsaw University of Technology, Institute of Radioelectronics, Nowowiejska 15/19, 00-665 Warsaw, Poland;Warsaw University of Technology, Institute of Radioelectronics, Nowowiejska 15/19, 00-665 Warsaw, Poland;Warsaw University of Technology, Institute of Radioelectronics, Nowowiejska 15/19, 00-665 Warsaw, Poland

  • Venue:
  • PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The problem of classifying web documents is studied in this paper. A graph-based instead of traditional vector-based model is used for document representation. A novel classification algorithm which uses two different types of structural patterns (subgraphs): contrast and common is proposed. This approach is strongly associated with the classical emerging patterns techniques known from decision tables. The presented method is evaluated on three different benchmark web documents collections for measuring classification accuracy. Results show that it can outperform other existing algorithms (based on vector, graph, and hybrid document representation) in terms of accuracy and document model complexity. Another advantage is that the introduced classifier has a simple, understandable structure and can be easily extended by the expert knowledge.