A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

  • Authors:
  • Georgios Sakkis;Ion Androutsopoulos;Georgios Paliouras;Vangelis Karkaletsis;Constantine D. Spyropoulos;Panagiotis Stamatopoulos

  • Affiliations:
  • Institute of Informatics and Telecommunications, National Centre for Scientific Research (NCSR) “Demokritos”, GR-153 10 Ag. Paraskevi, Athens, Greece. gsakis@iit.demokritos.gr< ...;Department of Informatics, Athens University of Economics and Business, Patission 76, GR-104 34, Athens, Greece. ion@aueb.gr;Institute of Informatics and Telecommunications, National Centre for Scientific Research (NCSR) “Demokritos”, GR-153 10 Ag. Paraskevi, Athens, Greece. paliourg@iit.demokritos.g ...;Institute of Informatics and Telecommunications, National Centre for Scientific Research (NCSR) “Demokritos”, GR-153 10 Ag. Paraskevi, Athens, Greece. vangelis@iit.demokritos.g ...;Institute of Informatics and Telecommunications, National Centre for Scientific Research (NCSR) “Demokritos”, GR-153 10 Ag. Paraskevi, Athens, Greece. costass@iit.demokritos.gr ...;Department of Informatics, University of Athens, TYPA Buildings, Panepistimiopolis, GR-157 71, Athens, Greece. T.Stamatopoulos@di.uoa.gr

  • Venue:
  • Information Retrieval
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

This paper presents an extensive empirical evaluation of memory-based learning in the context of anti-spam filtering, a novel cost-sensitive application of text categorization that attempts to identify automatically unsolicited commercial messages that flood mailboxes. Focusing on anti-spam filtering for mailing lists, a thorough investigation of the effectiveness of a memory-based anti-spam filter is performed using a publicly available corpus. The investigation includes different attribute and distance-weighting schemes, and studies on the effect of the neighborhood size, the size of the attribute set, and the size of the training corpus. Three different cost scenarios are identified, and suitable cost-sensitive evaluation functions are employed. We conclude that memory-based anti-spam filtering for mailing lists is practically feasible, especially when combined with additional safety nets. Compared to a previously tested Naive Bayes filter, the memory-based filter performs on average better, particularly when the misclassification cost for non-spam messages is high.