Web spam filtering in internet archives

  • Authors:
  • Miklós Erdélyi;András A. Benczúr;Julien Masanés;Dávid Siklósi

  • Affiliations:
  • University of Pannonia and Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;European Archive Foundation, France;Computer and Automation Research Institute of the Hungarian Academy of Sciences

  • Venue:
  • Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.