Web spam classification: a few features worth more

Authors:
Miklós Erdélyi;András Garzó;András A. Benczúr
Affiliations:
Hungarian Academy of Sciences and University of Pannonia, Veszprém;Hungarian Academy of Sciences;Hungarian Academy of Sciences
Venue:
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Year:
2011

Citing 26
Cited 7

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Random Forests

Machine Learning
Challenges in web search engines

ACM SIGIR Forum
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Ensemble selection from libraries of models

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Incremental page rank computation on evolving graphs

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction

GI '05 Proceedings of Graphics Interface 2005
Spam: It's Not Just for Inboxes Anymore

Computer
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Divide and conquer approach for efficient pagerank computation

ICWE '06 Proceedings of the 6th international conference on Web engineering
A reference collection for web spam

ACM SIGIR Forum
Getting the Most Out of Ensemble Selection

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Detecting Link Spam Using Temporal Information

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Cleaning search results using term distance features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Predicting web spam with HTTP session information

Proceedings of the 17th ACM conference on Information and knowledge management
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web
Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
A study of link farm distribution and evolution using a time series of web snapshots

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam filtering in internet archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
The 1st temporal web analytics workshop (TWAW)

Proceedings of the 20th international conference companion on World wide web

Content-based trust and bias classification via biclustering

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Fighting against web spam: a novel propagation method based on click-through data

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Russian web spam evolution: yandex experience

Proceedings of the 22nd international conference on World Wide Web companion
Cross-lingual web spam classification

Proceedings of the 22nd international conference on World Wide Web companion
Automatically generated spam detection based on sentence-level topic information

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows: • We collect and handle a large number of features based on recent advances in Web spam filtering. • We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy. • We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features. • We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010. Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.