Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical
Advances in kernel methods
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying link farm spam pages
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Topical TrustRank: using topicality to combat web spam
Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web
Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detecting Link Spam Using Temporal Information
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Splog detection using self-similarity analysis on blog temporal dynamics
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Cleaning search results using term distance features
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Latent dirichlet allocation in web spam filtering
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Robust PageRank and locally computable spam detection features
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam challenge proposal for filtering in archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Automatic seed set expansion for trust propagation based anti-spamming algorithms
Proceedings of the eleventh international workshop on Web information and data management
Identifying spam link generators for monitoring emerging web spam
Proceedings of the 4th workshop on Information credibility
Temporal query log profiling to improve web search ranking
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning to detect web spam by genetic programming
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Let web spammers expose themselves
Proceedings of the fourth ACM international conference on Web search and data mining
Web spam classification: a few features worth more
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Foundations and Trends in Information Retrieval
Content-based analysis to detect Arabic web spam
Journal of Information Science
Automatic seed set expansion for trust propagation based anti-spam algorithms
Information Sciences: an International Journal
Hi-index | 0.00 |
Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.