UbiCrawler: a scalable fully distributed web crawler
Software—Practice & Experience
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam
ACM SIGIR Forum
ACM SIGIR Forum
Looking into the past to better classify web spam
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
A study of link farm distribution and evolution using a time series of web snapshots
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam filtering in internet archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam filtering in internet archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam classification: a few features worth more
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Cross-lingual web spam classification
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
In this paper we propose new tasks for a possible future Web Spam Challenge motivated by the needs of the archival community. The Web archival community consists of several relatively small institutions that operate independently and possibly over different top level domains (TLDs). Each of them may have a large set of historic crawls. Efficient filtering would hence require (1) enhanced use of the time series of domain snapshots and (2) collaboration by transferring models across different TLDs. Corresponding Challenge tasks could hence include the distribution of crawl snapshot data for feature generation as well as classification of unlabeled new crawls of the same or even different TLDs.