On Collection Size and Retrieval Effectiveness
Information Retrieval
Beyond PageRank: machine learning for static ranking
Proceedings of the 15th international conference on World Wide Web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Link analysis for Web spam detection
ACM Transactions on the Web (TWEB)
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Nullification test collections for web spam and SEO
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Is spam an issue for opinionated blog post search?
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Intent-aware search result diversification
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Effectiveness beyond the first crawl tier
Proceedings of the 20th ACM international conference on Information and knowledge management
Identifying Web Spam with the Wisdom of the Crowds
ACM Transactions on the Web (TWEB)
Combining implicit and explicit topic representations for result diversification
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
ChatNoir: a search engine for the ClueWeb09 corpus
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
On the usefulness of query features for learning to rank
Proceedings of the 21st ACM international conference on Information and knowledge management
Effects of spam removal on search engine efficiency and effectiveness
Proceedings of the Seventeenth Australasian Document Computing Symposium
Reordering an index to speed query processing without loss of effectiveness
Proceedings of the Seventeenth Australasian Document Computing Symposium
ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Efficient and effective retrieval using selective pruning
Proceedings of the sixth ACM international conference on Web search and data mining
Using document-quality measures to predict web-search effectiveness
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Scaling big data mining infrastructure: the twitter experience
ACM SIGKDD Explorations Newsletter
Ranking document clusters using markov random fields
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Utilizing query change for session search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Task-aware query recommendation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Shame to be sham: addressing content-based grey hat search engine optimization
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Estimating topical context by diverging from external resources
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Query change as relevance feedback in session search
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Cross-lingual web spam classification
Proceedings of the 22nd international conference on World Wide Web companion
Unsupervised latent concept modeling to identify query facets
Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Incorporating social anchors for ad hoc retrieval
Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
About learning models with multiple query-dependent features
ACM Transactions on Information Systems (TOIS)
Fast candidate generation for real-time tweet search with bloom filter chains
ACM Transactions on Information Systems (TOIS)
Maintaining discriminatory power in quantized indexes
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Proceedings of the 18th Australasian Document Computing Symposium
The whens and hows of learning to rank for web search
Information Retrieval
Document vector representations for feature extraction in multi-stage document ranking
Information Retrieval
Hybrid email spam detection model with negative selection algorithm and differential evolution
Engineering Applications of Artificial Intelligence
Hi-index | 0.00 |
The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam--pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the "spamminess" of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of "honeypot" queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering--from among the worst to among the best.