Spam detection in online classified advertisements

Authors:
Hung Tran;Thomas Hornbeck;Viet Ha-Thuc;James Cremer;Padmini Srinivasan
Affiliations:
University of Iowa, Iowa City, IA;University of Iowa, Iowa City, IA;University of Iowa, Iowa City, IA;University of Iowa, Iowa City, IA;University of Iowa, Iowa City, IA
Venue:
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Year:
2011

Citing 21
Cited 0

C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Latent dirichlet allocation

The Journal of Machine Learning Research
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting nepotistic links by language model disagreement

Proceedings of the 15th international conference on World Wide Web
A reference collection for web spam

ACM SIGIR Forum
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Detecting image spam using visual features and near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Exploring linguistic features for web spam detection: a preliminary study

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam identification through content and hyperlinks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Image spam clustering: an unsupervised approach

MiFor '09 Proceedings of the First ACM workshop on Multimedia in forensics
Spam detection with a content-based random-walk algorithm

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents

Quantified Score

Hi-index	0.00

Visualization

Abstract

Online classified advertisements have become an essential part of the advertisement market. Popular online classified advertisement sites such as Craigslist, Ebay Classifieds, and Oodle have attracted a huge number of posts and visits. Due to its high commercial potential, the online classified advertisement domain is a target for spammers, and this has become one of the biggest issues hindering further development of online advertisement. Therefore, spam detection in online advertisement is a crucial problem. However, previous approaches for Web spam detection in other domains do not work well in the advertisement domain. We propose a novel spam detection approach that takes into account the particular characteristics of this domain. Specifically, we propose a novel set of features that could strongly discriminate between spam and legitimate advertisement posts. Our experiments on a dataset derived from Craigslist advertisements demonstrate the effectiveness of our approach. In particular, the approach provides improvements of 55% in terms of F-1 score over a baseline that uses traditional features alone.