Highly efficient algorithms for structural clustering of large websites

  • Authors:
  • Lorenzo Blanco;Nilesh Dalvi;Ashwin Machanavajjhala

  • Affiliations:
  • Università degli Studi Roma Tre, Rome, Italy;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA

  • Venue:
  • Proceedings of the 20th international conference on World wide web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we present a highly scalable algorithm for structurally clustering webpages for extraction. We show that, using only the URLs of the webpages and simple content features, it is possible to cluster webpages effectively and efficiently. At the heart of our techniques is a principled framework, based on the principles of information theory, that allows us to effectively leverage the URLs, and combine them with content and structural properties. Using an extensive evaluation over several large full websites, we demonstrate the effectiveness of our techniques, at a scale unattainable by previous techniques.