Tokenising, stemming and stopword removal on anti-spam filtering domain

Authors:
J. R. Méndez;E. L. Iglesias;F. Fdez-Riverola;F. Díaz;J. M. Corchado
Affiliations:
Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Ourense, Spain;Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Ourense, Spain;Dept. Informática, University of Vigo, Escuela Superior de Ingeniería Informática, Edificio Politécnico, Ourense, Spain;Dept. Informática, University of Valladolid, Escuela Universitaria de Informática, Segovia, Spain;Dept. Informática y Automática, University of Salamanca, Salamanca, Spain
Venue:
CAEPIA'05 Proceedings of the 11th Spanish association conference on Current Topics in Artificial Intelligence
Year:
2005

Citing 13
Cited 3

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
The nature of statistical learning theory

The nature of statistical learning theory
Learning in the presence of concept drift and hidden contexts

Machine Learning
An algorithm for suffix stripping

Readings in information retrieval
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
The State of the Art in Text Filtering

User Modeling and User-Adapted Interaction
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Diagnosis and Decision Support

Case-Based Reasoning Technology, From Foundations to Applications
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2

Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Assessing Classification Accuracy in the Revision Stage of a CBR Spam Filtering System

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Relaxing feature selection in spam filtering by using case-based reasoning systems

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Junk e-mail detection and filtering can be considered a cost-sensitive classification problem. Nevertheless, preprocessing methods and noise reduction strategies used to enhance the computational efficiency in text classification cannot be so efficient in e-mail filtering. This fact is demonstrated here where a comparative study of the use of stopword removal, stemming and different tokenising schemes is presented. The final goal is to preprocess the training e-mail corpora of several content-based techniques for spam filtering (machine approaches and case-based systems). Soundness conclusions are extracted from the experiments carried out where different scenarios are taken into consideration.