Using neighbors to date web documents

  • Authors:
  • Sérgio Nunes;Cristina Ribeiro;Gabriel David

  • Affiliations:
  • Faculdade de Engenharia da Universidade do Porto, Porto, Portugal;Faculdade de Engenharia da Universidade do Porto/INESC-Porto, Porto, Portugal;Faculdade de Engenharia da Universidade do Porto/INESC-Porto, Porto, Portugal

  • Venue:
  • Proceedings of the 9th annual ACM international workshop on Web information and data management
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's last update time. The main problem with this approach is that it is applicable to a small part of web documents. In this work, we evaluate an alternative strategy based on a document's neighborhood. Using a random sample containing 10,000 URLs from the Yahoo! Directory, we study each document's links and media assets to determine its age. If we only consider isolated documents, we are able to date 52% of them. Including the document's neighborhood, we are able to estimate the date of more than 86% of the same sample. Also, we find that estimates differ significantly according to the type of neighbors used. The most reliable estimates are based on the document's media assets, while the worst estimates are based on incoming links. These results are experimentally evaluated with a real world application using different datasets.