Information retrieval: data structures and algorithms
Information retrieval: data structures and algorithms
The nature of statistical learning theory
The nature of statistical learning theory
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
Fast training of support vector machines using sequential minimal optimization
Advances in kernel methods
Handling concept drifts in incremental learning with support vector machines
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
BoosTexter: A Boosting-based Systemfor Text Categorization
Machine Learning - Special issue on information retrieval
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
The State of the Art in Text Filtering
User Modeling and User-Adapted Interaction
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
Information Retrieval
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
IBERAMIA 2002 Proceedings of the 8th Ibero-American Conference on AI: Advances in Artificial Intelligence
Diagnosis and Decision Support
Case-Based Reasoning Technology, From Foundations to Applications
Word association norms, mutual information, and lexicography
ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
SpamHunting: An instance-based reasoning system for spam labelling and filtering
Decision Support Systems
A study of cross-validation and bootstrap for accuracy estimation and model selection
IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Maximum likelihood hebbian learning based Retrieval method for CBR systems
ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
Assessing Classification Accuracy in the Revision Stage of a CBR Spam Filtering System
ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Searching for Interacting Features for Spam Filtering
ISNN '08 Proceedings of the 5th international symposium on Neural Networks: Advances in Neural Networks
Computing a Comprehensible Model for Spam Filtering
DS '09 Proceedings of the 12th International Conference on Discovery Science
Relaxing feature selection in spam filtering by using case-based reasoning systems
EPIA'07 Proceedings of the aritficial intelligence 13th Portuguese conference on Progress in artificial intelligence
SDAI: An integral evaluation methodology for content-based spam filtering models
Expert Systems with Applications: An International Journal
Grindstone4Spam: An optimization toolkit for boosting e-mail classification
Journal of Systems and Software
Hi-index | 0.00 |
In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain, χ2-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.