A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
Novelty and diversity in information retrieval evaluation
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Expected reciprocal rank for graded relevance
Proceedings of the 18th ACM conference on Information and knowledge management
Efficient and effective spam filtering and re-ranking for large web datasets
Information Retrieval
Diversified relevance feedback
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
The seventeenth australasian document computing symposium
ACM SIGIR Forum
Proceedings of the 18th Australasian Document Computing Symposium
Hi-index | 0.00 |
Spam has long been identified as a problem that web search engines are required to deal with. Large collection sizes are also an increasing issue for institutions that do not have the necessary resources to process them in their entirety. In this paper we investigate the effect that withholding documents identified as spam has on the resources required to process large collections. We also investigate the resulting search effectiveness and efficiency when different amounts of spam are withheld. We find that by removing spam at indexing time we are able to decrease the index size without affecting the indexing throughput, and are able to improve search precision for some thresholds.