A comparative performance study of feature selection methods for the anti-spam filtering domain

Authors:
J. R. Méndez;F. Fdez-Riverola;F. Díaz;E. L. Iglesias;J. M. Corchado
Affiliations:
Dept. Informática, University of Vigo, Ourense, Spain;Dept. Informática, University of Vigo, Ourense, Spain;Dept. Informática, University of Valladolid, Segovia, Spain;Dept. Informática, University of Vigo, Ourense, Spain;Dept. Informática y Automática, University of Salamanca, Salamanca, Spain
Venue:
ICDM'06 Proceedings of the 6th Industrial Conference on Data Mining conference on Advances in Data Mining: applications in Medicine, Web Mining, Marketing, Image and Signal Mining
Year:
2006

Citing 18
Cited 8

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
The nature of statistical learning theory

The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
Handling concept drifts in incremental learning with support vector machines

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
The State of the Art in Text Filtering

User Modeling and User-Adapted Interaction
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Adaptive Bayes

IBERAMIA 2002 Proceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence
Diagnosis and Decision Support

Case-Based Reasoning Technology, From Foundations to Applications
Word association norms, mutual information, and lexicography

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
SpamHunting: An instance-based reasoning system for spam labelling and filtering

Decision Support Systems
A study of cross-validation and bootstrap for accuracy estimation and model selection

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Maximum likelihood hebbian learning based Retrieval method for CBR systems

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Assessing Classification Accuracy in the Revision Stage of a CBR Spam Filtering System

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Searching for Interacting Features for Spam Filtering

ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks
Computing a Comprehensible Model for Spam Filtering

DS '09 Proceedings of the 12th International Conference on Discovery Science
Relaxing feature selection in spam filtering by using case-based reasoning systems

EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
SDAI: An integral evaluation methodology for content-based spam filtering models

Expert Systems with Applications: An International Journal
Rough sets for spam filtering: Selecting appropriate decision rules for boundary e-mail classification

Applied Soft Computing
Grindstone4Spam: An optimization toolkit for boosting e-mail classification

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.