Web spam filtering in internet archives

Authors:
Miklós Erdélyi;András A. Benczúr;Julien Masanés;Dávid Siklósi
Affiliations:
University of Pannonia and Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;European Archive Foundation, France;Computer and Automation Research Institute of the Hungarian Academy of Sciences
Venue:
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Year:
2009

Citing 22
Cited 3

Synchronizing a database to improve freshness

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Challenges in web search engines

ACM SIGIR Forum
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
UbiCrawler: a scalable fully distributed web crawler

Software—Practice & Experience
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A reference collection for web spam

ACM SIGIR Forum
Web Archiving

Web Archiving
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A large time-aware web graph

ACM SIGIR Forum
Temporal Evolution of the UK Web

ICDMW '08 Proceedings of the 2008 IEEE International Conference on Data Mining Workshops
Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Thwarting the nigritude ultramarine: learning to identify link spam

ECML'05 Proceedings of the 16th European conference on Machine Learning

Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.