Nullification test collections for web spam and SEO

Authors:
Timothy Jones;Ramesh Sankaranarayana;David Hawking;Nick Craswell
Affiliations:
The Australian National University, Canberra, Australia;The Australian National University, Canberra, Australia;Funnelback Pty Ltd, Canberra, Australia;Microsoft Research, Cambridge, UK
Venue:
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Year:
2009

Citing 21
Cited 3

Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Journal of the ACM (JACM)
Effective site finding using link anchor information

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
Local versus global link information in the Web

ACM Transactions on Information Systems (TOIS)
Adaptive on-line page importance computation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Challenges in web search engines

ACM SIGIR Forum
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Optimizing web search using web click-through data

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
Evaluation by comparing result sets in context

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A reference collection for web spam

ACM SIGIR Forum
Optimized query execution in large search engines with global page ordering

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A large-scale study of automated web search traffic

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Robust PageRank and locally computable spam detection features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Adversarial Web Search

Foundations and Trends in Information Retrieval
Efficient and effective spam filtering and re-ranking for large web datasets

Information Retrieval
Shame to be sham: addressing content-based grey hat search engine optimization

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Research in the area of adversarial information retrieval has been facilitated by the availability of the UK-2006/UK-2007 collections, comprising crawl data, link graph, and spam labels. However, research into nullifying the negative effect of spam or excessive search engine optimisation (SEO) on the ranking of non-spam pages is not well supported by these resources. Nor is the study of cloaking techniques or of click spam. Finally, the domain-restricted nature of a .uk crawl means that only parts of link-farm icebergs may be visible in these crawls. We introduce the term nullification which we define as "preventing problem pages from negatively affecting search results". We show some important differences between properties of current .uk-restricted crawls and those previously reported for the Web as a whole. We identify a need for an adversarial IR collection which is not domain-restricted and which is supported by a set of appropriate query sets and (optimistically) user-behaviour data. The billion-page unrestricted crawl being conducted by CMU (web09-bst) and which will be used in the 2009 TREC Web Track is assessed as a possible basis for a new AIR test collection. We discuss the pros and cons of its scale, and the feasibility of adding resources such as query lists to enhance the utility of the collection for AIR research.