Automatically generated spam detection based on sentence-level topic information

Authors:
Yoshihiko Suhara;Hiroyuki Toda;Shuichi Nishioka;Seiji Susaki
Affiliations:
NTT Service Evolution Laboratories, NTT Corporation, Yokosuka-shi, Kanagawa, Japan;NTT Service Evolution Laboratories, NTT Corporation, Yokosuka-shi, Kanagawa, Japan;NTT Service Evolution Laboratories, NTT Corporation, Yokosuka-shi, Kanagawa, Japan;NTT Service Evolution Laboratories, NTT Corporation, Yokosuka-shi, Kanagawa, Japan
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 16
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Latent dirichlet allocation

The Journal of Machine Learning Research
Japanese morphological analyzer using word co-occurrence: JTAG

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Linked latent Dirichlet allocation in web spam filtering

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Aspect and sentiment unification model for online review analysis

Proceedings of the fourth ACM international conference on Web search and data mining
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Detection of near-duplicate user generated contents: the SMS spam collection

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Sweeping through the topic space: bad luck? Roll again!

ROBUS-UNSUP '12 Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP
TopicTiling: a text segmentation algorithm based on LDA

ACL '12 Proceedings of ACL 2012 Student Research Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spammers use a wide range of content generation techniques with low quality pages known as content spam to achieve their goals. We argue that content spam must be tackled using a wide range of content quality features. In this paper, we propose novel sentence-level diversity features based on the probabilistic topic model. We combine them with other content features to build a content spam classifier. Our experiments show that our method outperforms the conventional methods.