Granular modeling of web documents: impact on information retrieval systems

  • Authors:
  • Elisabetta Fersini;Enza Messina;Francesco Archetti

  • Affiliations:
  • University of Milano-Bicocca, Milano, Italy;University of Milano-Bicocca, Milano, Italy;University of Milano-Bicocca, Milano, Italy

  • Venue:
  • Proceedings of the 10th ACM workshop on Web information and data management
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most important tasks in Information Retrieval (IR) is related to web page information extraction and processing. It is a common approach to consider a web page as an atomic unit and to model its textual content as a "bag-of-words". However, this kind of representation does not reflect how people perceive a web page. A granular document representation, in terms of semantic objects, can help in identifying semantic areas of a web page and using them for different IR goals. In this paper we use a granular representation to define a new metric for evaluating semantic object importance and to enhance the performance of IR systems. In particular we show that this new metric can be used not only for classification goals, in which instances are assumed as independent and identically distributed, but also to gauge the strength of relationship between hypertextual documents and exploit this information for improving page ranking performance.