Web spam challenge proposal for filtering in archives

  • Authors:
  • András A. Benczúr;Miklós Erdélyi;Julien Masanés;Dávid Siklósi

  • Affiliations:
  • Computer and Automation Research Institute of the Hungarian Academy of Sciences;University of Pannonia and Computer and Automation Research Institute of the Hungarian Academy of Sciences;European Archive Foundation, France;Computer and Automation Research Institute of the Hungarian Academy of Sciences

  • Venue:
  • Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we propose new tasks for a possible future Web Spam Challenge motivated by the needs of the archival community. The Web archival community consists of several relatively small institutions that operate independently and possibly over different top level domains (TLDs). Each of them may have a large set of historic crawls. Efficient filtering would hence require (1) enhanced use of the time series of domain snapshots and (2) collaboration by transferring models across different TLDs. Corresponding Challenge tasks could hence include the distribution of crawl snapshot data for feature generation as well as classification of unlabeled new crawls of the same or even different TLDs.