Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification

Authors:
Petros Belsis;Kostas Fragos;Stefanos Gritzalis;Christos Skourlas
Affiliations:
(Correspd. Tel.: +30 22730 82234/ Fax: +30 22730 82009/ E-mail: pbelsis@aegean.gr) Dept. of Info. and Comm. Sys. Eng., Univ. of the Aegean, Samos, 83200 Greece and Dept. of Informatics, Technologi ...;Department of Electrical and Computer Engineering, National Technical University of Athens, Athens, 15771 Greece;Department of Information and Communication Systems Engineering, University of the Aegean, Samos, 83200 Greece;Department of Informatics, Technological Education Institute of Athens, Egaleo, 12210 Greece
Venue:
Journal of Computer Security - Best papers of the Sec Track at the 2006 ACM Symposium
Year:
2009

Citing 25
Cited 0

A practical approach to feature selection

ML92 Proceedings of the ninth international workshop on Machine learning
Hierarchical mixtures of experts and the EM algorithm

Neural Computation
Spam!

Communications of the ACM
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating cost-sensitive Unsolicited Bulk Email categorization

Proceedings of the 2002 ACM symposium on Applied computing
Mining e-mail content for author identification forensics

ACM SIGMOD Record
Feature Subset Selection in Text-Learning

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Boosting the margin: A new explanation for the effectiveness of voting methods

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Challenges of the Email Domain for Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Statistical modelling of artificial neural networks using the multi-layer perceptron

Statistics and Computing
Email classification with co-training

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine

SAINT '03 Proceedings of the 2003 Symposium on Applications and the Internet
Context-Dependent Hybrid HME/HMM Speech Recognition using Polyphone Clustering Decision Trees

ICASSP '97 Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97)-Volume 3 - Volume 3
Using latent semantic indexing to filter spam

Proceedings of the 2003 ACM symposium on Applied computing
Spam filters: bayes vs. chi-squared; letters vs. words

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Fighting the spam wars: A remailer approach with restrictive aliasing

ACM Transactions on Internet Technology (TOIT)
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Margin based feature selection - theory and algorithms

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Feature selection and feature extraction for text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Adaptive mixtures of local experts

Neural Computation
Will New Standards Help Curb Spam?

Computer
Support vector machines for spam categorization

IEEE Transactions on Neural Networks
Using the EM algorithm to train neural networks: misconceptions and a new algorithm for multiclass classification

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) - HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.