The anatomy of a large-scale hypertextual Web search engine
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Relevance based language models
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Gaussian Processes for Ordinal Regression
The Journal of Machine Learning Research
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A reference collection for web spam
ACM SIGIR Forum
Improving web spam classification using rank-time features
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Relaxed online SVMs for spam filtering
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Query performance prediction in web search environments
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Nullification test collections for web spam and SEO
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Effective pre-retrieval query performance prediction using similarity and variability evidence
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Estimating the Query Difficulty for Information Retrieval
Estimating the Query Difficulty for Information Retrieval
Quality-biased ranking of web documents
Proceedings of the fourth ACM international conference on Web search and data mining
Efficient and effective spam filtering and re-ranking for large web datasets
Information Retrieval
Using document-quality measures to predict web-search effectiveness
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Hi-index | 0.00 |
We present an initial study identifying a form of content-based grey hat search engine optimization, in which a Web page contains both potentially relevant content and manipulated content: we call such pages sham documents, because they lie in the grey area between 'ham' (clearly normal) and 'spam' (clearly fake). Sham documents are often ranked artificially high in response to certain queries, but also may contain some useful information and cannot be considered as absolute spam. We report a novel annotation effort performed with the ClueWeb09 benchmark where pages were labeled as being spam, sham, or legitimate content. Significant inter-annotator agreement rates support the claim that there are sham documents that are highly ranked by a very effective retrieval approach, yet are not spam. We also present an initial study of predictors that may indicate whether a query is the target of shamming.