Web document clustering using hyperlink structures

  • Authors:
  • Xiaofeng He;Hongyuan Zha;Chris H.Q. Ding;Horst D. Simon

  • Affiliations:
  • Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA and NERSC Division, Lawrence Berkeley National Laboratory, University of Californi ...;Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA;NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA;NERSC Division, Lawrence Berkeley National Laboratory, University of California, Berkeley, CA 94720, USA

  • Venue:
  • Computational Statistics & Data Analysis
  • Year:
  • 2002

Quantified Score

Hi-index 0.03

Visualization

Abstract

With the exponential growth of information on the World Wide Web, there is great demand for developing efficient methods for effectively organizing the large amount of retrieved information. Document clustering plays an important role in information retrieval and taxonomy management for the Web. In this paper we examine three clustering methods: K-means, multi-level METIS, and the recently developed normalized-cut method using a new approach of combining textual information, hyperlink structure and co-citation relations into a single similarity metric. We found the normalized-cut method with the new similarity metric is particularly effective, as demonstrated on three datasets of web query results. We also explore some theoretical connections between the normalized-cut method and the K-means method.