Combining winnow and orthogonal sparse bigrams for incremental spam filtering

  • Authors:
  • Christian Siefkes;Fidelis Assis;Shalendra Chhabra;William S. Yerazunis

  • Affiliations:
  • Freie Universität Berlin, Berlin, Germany;Empresa Brasileira de Telecomunicaçöes - Embratel, Rio de Janeiro, RJ, Brazil;University of California, Riverside, California;Mitsubishi Electric Research Laboratories, Cambridge, MA

  • Venue:
  • PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.