Unsupervised Spam Detection by Document Complexity Estimation

Authors:
Takashi Uemura;Daisuke Ikeda;Hiroki Arimura
Affiliations:
Hokkaido University, Sapporo, Japan 060-0814;Kyushu University, Fukuoka, Japan 819-0395;Hokkaido University, Sapporo, Japan 060-0814
Venue:
DS '08 Proceedings of the 11th International Conference on Discovery Science
Year:
2008

Citing 6
Cited 3

Estimating alphanumeric selectivity in the presence of wildcards

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Substring selectivity estimation

PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Density-based spam detector

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Unsupervised spam detection based on string alienness measures

DS'07 Proceedings of the 10th international conference on Discovery science

Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Behaviour-Based web spambot detection by utilising action time and action frequency

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
How much money do spammers make from your website?

Proceedings of the CUBE International Information Technology Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we study a content-based spam detection for a specific type of spams, called blogand bulletin board spams. We develop an efficient unsupervised algorithm DCEthat detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.