Estimating alphanumeric selectivity in the presence of wildcards
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Substring selectivity estimation
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Combating web spam with trustrank
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Unsupervised spam detection based on string alienness measures
DS'07 Proceedings of the 10th international conference on Discovery science
Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Behaviour-Based web spambot detection by utilising action time and action frequency
ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
How much money do spammers make from your website?
Proceedings of the CUBE International Information Technology Conference
Hi-index | 0.00 |
In this paper, we study a content-based spam detection for a specific type of spams, called blogand bulletin board spams. We develop an efficient unsupervised algorithm DCEthat detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.