Unsupervised spam detection based on string alienness measures

Authors:
Kazuyuki Narisawa;Hideo Bannai;Kohei Hatano;Masayuki Takeda
Affiliations:
Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan
Venue:
DS'07 Proceedings of the 10th international conference on Discovery science
Year:
2007

Citing 6
Cited 3

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Discovering characteristic expressions in literary works

Theoretical Computer Science
Density-based spam detector

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Site level noise removal for search engines

Proceedings of the 15th international conference on World Wide Web
Efficient computation of substring equivalence classes with suffix arrays

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

String Kernels Based on Variable-Length-Don't-Care Patterns

DS '08 Proceedings of the 11th International Conference on Discovery Science
Unsupervised Spam Detection by Document Complexity Estimation

DS '08 Proceedings of the 11th International Conference on Discovery Science
Special factors and the combinatorics of suffix and factor automata

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams.