Tokenising, stemming and stopword removal on anti-spam filtering domain

  • Authors:
  • J. R. Méndez;E. L. Iglesias;F. Fdez-Riverola;F. Díaz;J. M. Corchado

  • Affiliations:
  • Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Ourense, Spain;Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Ourense, Spain;Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Ourense, Spain;Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Segovia, Spain;Dept. Informática y Automática, University of Salamanca, Salamanca, Spain

  • Venue:
  • CAEPIA'05 Proceedings of the 11th Spanish association conference on Current Topics in Artificial Intelligence
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.