Page segmentation by web content clustering

  • Authors:
  • Sadet Alcic;Stefan Conrad

  • Affiliations:
  • Heinrich-Heine-University of Duesseldorf, Duesseldorf, Germany;Heinrich-Heine-University of Duesseldorf, Duesseldorf, Germany

  • Venue:
  • Proceedings of the International Conference on Web Intelligence, Mining and Semantics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web page segmentation is an important task with benefits for a variety of applications, reaching from data extraction to accessibility improvement. Focusing on the smallest content units of a web page, page segmentation can be reduced to a clustering of web contents to structural and semantical cohesive groups. To investigate the web page segmentation task from the clustering point of view, we define three possible distance measures for content units based on their DOM, geometric and semantic properties. We combine these distance measures with common clustering techniques and evaluate the web page segmentation accuracy on a labelled collection by applying widely used validity measures.