Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Machine Learning
Challenges in web search engines
ACM SIGIR Forum
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Ensemble selection from libraries of models
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Incremental page rank computation on evolving graphs
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
GI '05 Proceedings of Graphics Interface 2005
Topical TrustRank: using topicality to combat web spam
Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Divide and conquer approach for efficient pagerank computation
ICWE '06 Proceedings of the 6th international conference on Web engineering
A reference collection for web spam
ACM SIGIR Forum
Getting the Most Out of Ensemble Selection
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Detecting Link Spam Using Temporal Information
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cleaning search results using term distance features
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Predicting web spam with HTTP session information
Proceedings of the 17th ACM conference on Information and knowledge management
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web
Looking into the past to better classify web spam
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
A study of link farm distribution and evolution using a time series of web snapshots
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam filtering in internet archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
The 1st temporal web analytics workshop (TWAW)
Proceedings of the 20th international conference companion on World wide web
Content-based trust and bias classification via biclustering
Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Survey on web spam detection: principles and algorithms
ACM SIGKDD Explorations Newsletter
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Russian web spam evolution: yandex experience
Proceedings of the 22nd international conference on World Wide Web companion
Cross-lingual web spam classification
Proceedings of the 22nd international conference on World Wide Web companion
Automatically generated spam detection based on sentence-level topic information
Proceedings of the 22nd international conference on World Wide Web companion
Hi-index | 0.00 |
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows: • We collect and handle a large number of features based on recent advances in Web spam filtering. • We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy. • We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features. • We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010. Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.