Using biased discriminant analysis for email filtering

Authors:
Juan Carlos Gomez;Marie-Francine Moens
Affiliations:
ITESM, Monterrey, NL, Mexico;Katholieke Universiteit Leuven, Heverlee, Belgium
Venue:
KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Year:
2010

Citing 13
Cited 1

Introduction to statistical pattern recognition (2nd ed.)

Introduction to statistical pattern recognition (2nd ed.)
Instance-Based Learning Algorithms

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Neural Networks for Pattern Recognition

Neural Networks for Pattern Recognition
Latent dirichlet allocation

The Journal of Machine Learning Research
Generalized Discriminant Analysis Using a Kernel Approach

Neural Computation
Learning to detect phishing emails

Proceedings of the 16th international conference on World Wide Web
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
A comparative study for content-based dynamic spam classification using four machine learning algorithms

Knowledge-Based Systems
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal

PCA document reconstruction for email classification

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks. We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus. The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.