Using neighbors to date web documents

Authors:
Sérgio Nunes;Cristina Ribeiro;Gabriel David
Affiliations:
Faculdade de Engenharia da Universidade do Porto, Porto, Portugal;Faculdade de Engenharia da Universidade do Porto/INESC-Porto, Porto, Portugal;Faculdade de Engenharia da Universidade do Porto/INESC-Porto, Porto, Portugal
Venue:
Proceedings of the 9th annual ACM international workshop on Web information and data management
Year:
2007

Citing 15
Cited 3

How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Web Structure, Dynamics and Page Quality

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Characterization of a large web site population with implications for content delivery

Proceedings of the 13th international conference on World Wide Web
A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
Local methods for estimating pagerank values

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Trend detection through temporal link analysis

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
Temporal multi-page summarization

Web Intelligence and Agent Systems
Agreeing to disagree: search engines and their public interfaces

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Rate of change and other metrics: a live study of the world wide web

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Web page publication time detection and its application for page rank

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Noise robust detection of the emergence and spread of topics on the web

Proceedings of the 2nd Temporal Web Analytics Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Time has been successfully used as a feature in web information retrieval tasks. In this context, estimating a document's inception date or last update date is a necessary task. Classic approaches have used HTTP header fields to estimate a document's last update time. The main problem with this approach is that it is applicable to a small part of web documents. In this work, we evaluate an alternative strategy based on a document's neighborhood. Using a random sample containing 10,000 URLs from the Yahoo! Directory, we study each document's links and media assets to determine its age. If we only consider isolated documents, we are able to date 52% of them. Including the document's neighborhood, we are able to estimate the date of more than 86% of the same sample. Also, we find that estimates differ significantly according to the type of neighbors used. The most reliable estimates are based on the document's media assets, while the worst estimates are based on incoming links. These results are experimentally evaluated with a real world application using different datasets.