Efficient and effective spam filtering and re-ranking for large web datasets

Authors:
Gordon V. Cormack;Mark D. Smucker;Charles L. Clarke
Affiliations:
University of Waterloo, Waterloo, Canada N2L 3G1;University of Waterloo, Waterloo, Canada N2L 3G1;University of Waterloo, Waterloo, Canada N2L 3G1
Venue:
Information Retrieval
Year:
2011

Citing 9
Cited 29

On Collection Size and Retrieval Effectiveness

Information Retrieval
Spam: It's Not Just for Inboxes Anymore

Computer
Beyond PageRank: machine learning for static ranking

Proceedings of the 15th international conference on World Wide Web
On-line spam filter fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
Evaluation over thousands of queries

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Nullification test collections for web spam and SEO

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Is spam an issue for opinionated blog post search?

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Intent-aware search result diversification

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Effectiveness beyond the first crawl tier

Proceedings of the 20th ACM international conference on Information and knowledge management
Identifying Web Spam with the Wisdom of the Crowds

ACM Transactions on the Web (TWEB)
Combining implicit and explicit topic representations for result diversification

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
ChatNoir: a search engine for the ClueWeb09 corpus

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Clarity re-visited

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On the usefulness of query features for learning to rank

Proceedings of the 21st ACM international conference on Information and knowledge management
Effects of spam removal on search engine efficiency and effectiveness

Proceedings of the Seventeenth Australasian Document Computing Symposium
Reordering an index to speed query processing without loss of effectiveness

Proceedings of the Seventeenth Australasian Document Computing Symposium
An evidence-based verification approach to extract entities and relations for knowledge base population

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Efficient and effective retrieval using selective pruning

Proceedings of the sixth ACM international conference on Web search and data mining
Using document-quality measures to predict web-search effectiveness

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Scaling big data mining infrastructure: the twitter experience

ACM SIGKDD Explorations Newsletter
Ranking document clusters using markov random fields

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Utilizing query change for session search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Task-aware query recommendation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Shame to be sham: addressing content-based grey hat search engine optimization

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Estimating topical context by diverging from external resources

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Query change as relevance feedback in session search

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
Unsupervised latent concept modeling to identify query facets

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Incorporating social anchors for ad hoc retrieval

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
About learning models with multiple query-dependent features

ACM Transactions on Information Systems (TOIS)
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)
Maintaining discriminatory power in quantized indexes

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Malformed UTF-8 and spam

Proceedings of the 18th Australasian Document Computing Symposium
The whens and hows of learning to rank for web search

Information Retrieval
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval
Hybrid email spam detection model with negative selection algorithm and differential evolution

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam--pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the "spamminess" of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of "honeypot" queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering--from among the worst to among the best.