Two-Phase Web Site Classification Based on Hidden Markov Tree Models

  • Authors:
  • YongHong Tian;TieJun Huang;Wen Gao;Jun Cheng;PingBo Kang

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the exponential growth of both the amount and diversity of the information that the web encompasses, automatic classification of topic-specific web sites is highly desirable. In this paper we propose a novel approach for web site classification based on the content, structure and context information of web sites. In our approach, the site structure is represented as a two-layered tree in which each page is modeled as a DOM (Document Object Model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the Hidden Markov Tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of web sites but still keeping high classification accuracy, an entropy-based approach is introduced todynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.