Improving web spam classification using rank-time features

Authors:
Krysta M. Svore;Qiang Wu;Chris J. C. Burges;Aaswath Raman
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Redmond, WA
Venue:
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Year:
2007

Citing 9
Cited 14

The nature of statistical learning theory

The nature of statistical learning theory
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Link spam alliances

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam

ACM SIGIR Forum
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Challenges in web search engines

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Adversarial Information Retrieval on the Web (AIRWeb 2007)

ACM SIGIR Forum
Identifying Spam Web Pages Based on Content Similarity

ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
Identifying web spam with user behavior analysis

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Spam characterization and detection in peer-to-peer file-sharing systems

Proceedings of the 17th ACM conference on Information and knowledge management
Cost-effective spam detection in p2p file-sharing systems

Proceedings of the 2008 ACM workshop on Large-Scale distributed systems for information retrieval
Web Spam Identification with User Browsing Graph

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Study on the Click Context of Web Search Users for Reliability Analysis

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Removing web spam links from search engine results

Journal in Computer Virology
Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Adversarial Web Search

Foundations and Trends in Information Retrieval
Identifying Web Spam with the Wisdom of the Crowds

ACM Transactions on the Web (TWEB)
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Content-based analysis to detect Arabic web spam

Journal of Information Science
Shame to be sham: addressing content-based grey hat search engine optimization

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.