Unsupervised Spam Detection by Document Complexity Estimation

  • Authors:
  • Takashi Uemura;Daisuke Ikeda;Hiroki Arimura

  • Affiliations:
  • Hokkaido University, Sapporo, Japan 060-0814;Kyushu University, Fukuoka, Japan 819-0395;Hokkaido University, Sapporo, Japan 060-0814

  • Venue:
  • DS '08 Proceedings of the 11th International Conference on Discovery Science
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we study a content-based spam detection for a specific type of spams, called blogand bulletin board spams. We develop an efficient unsupervised algorithm DCEthat detects spam documents from a mixture of spam and non-spam documents using an entropy-like measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for all documents in linear time w.r.t. the total length of input documents. Experimental results showed that our algorithm especially works well for detecting word salad spams, which are believed to be difficult to detect automatically.