Correlation-based Document Clustering using Web Logs

  • Authors:
  • Z. Su;Q. Yang;H. Zhang;X. Xu;Y. Hu

  • Affiliations:
  • -;-;-;-;-

  • Venue:
  • HICSS '01 Proceedings of the 34th Annual Hawaii International Conference on System Sciences ( HICSS-34)-Volume 5 - Volume 5
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

A problem facing information retrieval on the web is how to effectively cluster large amounts of web documents. One approach is to cluster the documents based on information provided only by users usage logs and not by the content of the documents. In this paper, we present a recursive density based clustering algorithm that can adaptively change its parameters intelligently. Our clustering algorithm RDBC is based on DBSCAN, a density based algorithm that has been proven in its ability in processing very large datasets. The fact that DBSCAN does not require the pre-determination of the number of clusters and is linear in time complexity makes it particularly attractive in web page clustering. It can be shown that RDBC require the same time complexity as that of the DBSCAN algorithm. In addition, we prove both analytically and experimentally that our method yields clustering results that are superior to that of DBSCAN