Shame to be sham: addressing content-based grey hat search engine optimization

Authors:
Fiana Raiber;Kevyn Collins-Thompson;Oren Kurland
Affiliations:
Technion - Israel Institute of Technology, Haifa, Israel;Microsoft Research, Redmond, USA;Technion - Israel Institute of Technology, Haifa, Israel
Venue:
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Year:
2013

Citing 15
Cited 0

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Predicting query performance

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Gaussian Processes for Ordinal Regression

The Journal of Machine Learning Research
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A reference collection for web spam

ACM SIGIR Forum
Improving web spam classification using rank-time features

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Query performance prediction in web search environments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Nullification test collections for web spam and SEO

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Effective pre-retrieval query performance prediction using similarity and variability evidence

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Estimating the Query Difficulty for Information Retrieval

Estimating the Query Difficulty for Information Retrieval
Quality-biased ranking of web documents

Proceedings of the fourth ACM international conference on Web search and data mining
Efficient and effective spam filtering and re-ranking for large web datasets

Information Retrieval
Using document-quality measures to predict web-search effectiveness

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an initial study identifying a form of content-based grey hat search engine optimization, in which a Web page contains both potentially relevant content and manipulated content: we call such pages sham documents, because they lie in the grey area between 'ham' (clearly normal) and 'spam' (clearly fake). Sham documents are often ranked artificially high in response to certain queries, but also may contain some useful information and cannot be considered as absolute spam. We report a novel annotation effort performed with the ClueWeb09 benchmark where pages were labeled as being spam, sham, or legitimate content. Significant inter-annotator agreement rates support the claim that there are sham documents that are highly ranked by a very effective retrieval approach, yet are not spam. We also present an initial study of predictors that may indicate whether a query is the target of shamming.