Unsupervised spam detection based on string alienness measures

  • Authors:
  • Kazuyuki Narisawa;Hideo Bannai;Kohei Hatano;Masayuki Takeda

  • Affiliations:
  • Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan;Department of Informatics, Kyushu University, Fukuoka, Japan

  • Venue:
  • DS'07 Proceedings of the 10th international conference on Discovery science
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose an unsupervised method for detecting spam documents from a given set of documents, based on equivalence relations on strings. We give three measures for quantifying the alienness (i.e. how different they are from others) of substrings within the documents. A document is then classified as spam if it contains a substring that is in an equivalence class with a high degree of alienness. The proposed method is unsupervised, language independent, and scalable. Computational experiments conducted on data collected from Japanese web forums show that the method successfully discovers spams.