Classification of Web Documents Using a Graph-Based Model and Structural Patterns

Authors:
Andrzej Dominik;Zbigniew Walczak;Jacek Wojciechowski
Affiliations:
Warsaw University of Technology, Institute of Radioelectronics, Nowowiejska 15/19, 00-665 Warsaw, Poland;Warsaw University of Technology, Institute of Radioelectronics, Nowowiejska 15/19, 00-665 Warsaw, Poland;Warsaw University of Technology, Institute of Radioelectronics, Nowowiejska 15/19, 00-665 Warsaw, Poland
Venue:
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Year:
2007

Citing 11
Cited 1

A graph distance metric based on the maximal common subgraph

Pattern Recognition Letters
Efficient mining of emerging patterns: discovering trends and differences

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An Algorithm for Subgraph Isomorphism

Journal of the ACM (JACM)
Frequent Subgraph Discovery

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
CAEP: Classification by Aggregating Emerging Patterns

DS '99 Proceedings of the Second International Conference on Discovery Science
Mining Molecular Fragments: Finding Relevant Substructures of Molecules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Classification of Web Documents Using a Graph Model

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Frequent Sub-Structure-Based Approaches for Classifying Chemical Compounds

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Graph-theoretic techniques for web content mining

Graph-theoretic techniques for web content mining
Classifying Chemical Compounds Using Contrast and Common Patterns

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I

On the relation between jumping emerging patterns and rough set theory with application to data classification

Transactions on rough sets XII

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of classifying web documents is studied in this paper. A graph-based instead of traditional vector-based model is used for document representation. A novel classification algorithm which uses two different types of structural patterns (subgraphs): contrast and common is proposed. This approach is strongly associated with the classical emerging patterns techniques known from decision tables. The presented method is evaluated on three different benchmark web documents collections for measuring classification accuracy. Results show that it can outperform other existing algorithms (based on vector, graph, and hybrid document representation) in terms of accuracy and document model complexity. Another advantage is that the introduced classifier has a simple, understandable structure and can be easily extended by the expert knowledge.