Web Documents Categorization Using Fuzzy Representation and HAC

  • Authors:
  • Jiawei Deng;Lihui Chen

  • Affiliations:
  • -;-

  • Venue:
  • WISE '00 Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 2 - Volume 2
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most of the existing techniques for characterization of Web documents are based on term-frequency analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. However, as Web documents written in HTML are semi-structured documents by means of tags, the traditional techniques that assign term weights only by the frequency of occurrence may not be able to provide satisfactory results in representing the contents of such documents. Some recent studies have shown that the fuzzy representation (FR) of WWW information based on significance of HTML tag is an effective alternative for characterizing Web documents. In this paper, the FR to generate the feature vector for each Web document and the Hierarchical Agglomerative Clustering (HAC) algorithm are applied to investigate the efficiency and effectiveness for automatic categorization of Web documents with similar contents. Experiments conducted suggest several benefits of using such an approach.