PCA document reconstruction for email classification

Authors:
Juan Carlos Gomez;Marie-Francine Moens
Affiliations:
-;-
Venue:
Computational Statistics & Data Analysis
Year:
2012

Citing 26
Cited 4

A Sequential Factorization Method for Recovering Shape and Motion From Image Streams

IEEE Transactions on Pattern Analysis and Machine Intelligence
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Challenges of the Email Domain for Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
Using latent semantic indexing to filter spam

Proceedings of the 2003 ACM symposium on Applied computing
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Dimension Reduction in Text Classification with Support Vector Machines

The Journal of Machine Learning Research
Kernel PCA for novelty detection

Pattern Recognition
Learning to detect phishing emails

Proceedings of the 16th international conference on World Wide Web
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Relaxed online SVMs for spam filtering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of machine learning techniques for phishing detection

Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Multiframe Motion Segmentation with Missing Data Using PowerFactorization and GPCA

International Journal of Computer Vision
A comparative study for content-based dynamic spam classification using four machine learning algorithms

Knowledge-Based Systems
Latent dirichlet allocation in web spam filtering

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Object detection using image reconstruction with PCA

Image and Vision Computing
E-Mail Classification for Phishing Defense

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Knowledge extraction with non-negative matrix factorization for text classification

IDEAL'09 Proceedings of the 10th international conference on Intelligent data engineering and automated learning
Using biased discriminant analysis for email filtering

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Non-negative matrix factorization based text mining: feature extraction and classification

ICONIP'06 Proceedings of the 13th international conference on Neural Information Processing - Volume Part II
Short-Text classification based on ICA and LSA

ISNN'06 Proceedings of the Third international conference on Advnaces in Neural Networks - Volume Part II
Text classification: combining grouping, LSA and kNN vs support vector machine

KES'06 Proceedings of the 10th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Highly discriminative statistical features for email classification

Knowledge and Information Systems
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Document categorization based on minimum loss of reconstruction information

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II
Minimizer of the Reconstruction Error for multi-class document categorization

Expert Systems with Applications: An International Journal
An ExPosition of multivariate analysis with the singular value decomposition in R

Computational Statistics & Data Analysis
Rectifying the representation learned by Non-negative Matrix Factorization

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.03

Visualization

Abstract

This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.