A statistical approach to URL-based web page clustering

  • Authors:
  • Inma Hernández;Carlos R. Rivero;David Ruiz;Rafael Corchuelo

  • Affiliations:
  • University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain;University of Seville, Seville, Spain

  • Venue:
  • Proceedings of the 21st international conference companion on World Wide Web
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.