Indexing and querying segmented web pages: the BlockWeb Model

  • Authors:
  • Emmanuel Bruno;Nicolas Faessel;Hervé Glotin;Jacques Le Maitre;Michel Scholl

  • Affiliations:
  • LSIS, Université du Sud Toulon-Var, La Garde Cedex, France 83957;LSIS, Université Paul Cézanne, Marseille Cedex 20, France 13397;LSIS, Université du Sud Toulon-Var, La Garde Cedex, France 83957;LSIS, Université du Sud Toulon-Var, La Garde Cedex, France 83957;Cedric/Wisdom, CNAM, Paris Cedex 03, France 75141

  • Venue:
  • World Wide Web
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present in this paper a model for indexing and querying web pages, based on the hierarchical decomposition of pages into blocks. Splitting up a page into blocks has several advantages in terms of page design, indexing and querying such as (i) blocks of a page most similar to a query may be returned instead of the page as a whole (ii) the importance of a block can be taken into account, as well as (iii) the permeability of the blocks to neighbor blocks: a block b is said to be permeable to a block b驴 in the same page if b驴 content (text, image, etc.) can be (partially) inherited by b upon indexing. An engine implementing this model is described including: the transformation of web pages into blocks hierarchies, the definition of a dedicated language to express indexing rules and the storage of indexed blocks into an XML repository. The model is assessed on a dataset of electronic news, and a dataset drawn from web pages of the ImagEval campaign where it improves by 16% the mean average precision of the baseline.