Unsupervised clustering for nontextual web document classification

  • Authors:
  • Samuel W. K. Chan;Mickey W. C. Chong

  • Affiliations:
  • Department of Decision Sciences and Managerial Economics, The Chinese University of Hong Kong, Hong Kong, China;Department of Decision Sciences and Managerial Economics, The Chinese University of Hong Kong, Hong Kong, China

  • Venue:
  • Decision Support Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

While the breath of vocabulary used in long documents may mislead the traditional keyword-based retrieval systems, the demands for techniques in nontextual Web classification and retrieval from a large document collection are mounting. Only a few prototype systems have attempted to classify hypertext on the basis of nontextual elements in order to locate unfamiliar documents. As a result, a large portion of Web documents having pictorial information in nature is far beyond the reach of most current search engines. In this research, we devise a novel quantitative model of nontextual World Wide Web (WWW) classification based on image information. An intelligent content-sensitive, attribute-rich image classifier is presented. An image similarity measure is used to deduce the likelihood among images. Different image feature vectors have been constructed and evaluated. Evaluation shows images judged to be similar by human form interesting clusters in our unsupervised learning. Comparison with other clustering technique, such as Hierarchical Agglomerative Clustering (HAC), demonstrates that our approach is found useful in content-based image information retrieval.