A web page topic segmentation algorithm based on visual criteria and content layout

  • Authors:
  • Idir Chibane;Bich-Lien Doan

  • Affiliations:
  • SUPELEC, Gif sur Yvette, France;SUPELEC, Gif sur Yvette, France

  • Venue:
  • SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents experiments using an algorithm of web page topic segmentation that show significant precision improvement in the retrieval of documents issued from the Web track corpus of TREC 2001. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as headings ~, paragraph ). We conclude that combining visual and content layout criteria gives the best results for increasing the precision: the ranking of the page is calculated for relevant segments of pages resulting from the segmentation algorithm.