Looking into the past to better classify web spam

Authors:
Na Dai;Brian D. Davison;Xiaoguang Qi
Affiliations:
Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA
Venue:
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Year:
2009

Citing 16
Cited 10

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Making large-scale support vector machine learning practical

Advances in kernel methods
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting semantic cloaking on the web

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detecting Link Spam Using Temporal Information

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Cleaning search results using term distance features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Robust PageRank and locally computable spam detection features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Automatic seed set expansion for trust propagation based anti-spamming algorithms

Proceedings of the eleventh international workshop on Web information and data management
Identifying spam link generators for monitoring emerging web spam

Proceedings of the 4th workshop on Information credibility
Temporal query log profiling to improve web search ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning to detect web spam by genetic programming

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Let web spammers expose themselves

Proceedings of the fourth ACM international conference on Web search and data mining
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
Content-based analysis to detect Arabic web spam

Journal of Information Science
Automatic seed set expansion for trust propagation based anti-spam algorithms

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.