Learning to classify e-mail

Authors:
Irena Koprinska;Josiah Poon;James Clark;Jason Chan
Affiliations:
School of Information Technologies, The University of Sydney, Sydney, Australia;School of Information Technologies, The University of Sydney, Sydney, Australia;School of Information Technologies, The University of Sydney, Sydney, Australia;School of Information Technologies, The University of Sydney, Sydney, Australia
Venue:
Information Sciences: an International Journal
Year:
2007

Citing 20
Cited 24

C4.5: programs for machine learning

C4.5: programs for machine learning
Bagging predictors

Machine Learning
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast training of support vector machines using sequential minimal optimization

Advances in kernel methods
MailCat: an intelligent assistant for organizing e-mail

Proceedings of the third annual conference on Autonomous Agents
Foundations of statistical natural language processing

Foundations of statistical natural language processing
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Random Forests

Machine Learning
Neural Network Ensembles

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Efficient handling of high-dimensional feature spaces by randomized classifier ensembles

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Email classification with co-training

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
A Neural Network Based Approach to Automated E-Mail Classification

WI '03 Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Introduction to this special issue on revisiting and reinventing e-mail

Human-Computer Interaction
In search of coherence: a review of e-mail research

Human-Computer Interaction

Combining automatic acquisition of knowledge with machine learning approaches for multilingual temporal recognition and normalization

Information Sciences: an International Journal
Gaussian case-based reasoning for business failure prediction with empirical data in China

Information Sciences: an International Journal
Building a cost-constrained decision tree with multiple condition attributes

Information Sciences: an International Journal
Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem

Information Sciences: an International Journal
An innovative analyser for multi-classifier e-mail classification based on grey list analysis

Journal of Network and Computer Applications
Error bounds of multi-graph regularized semi-supervised classification

Information Sciences: an International Journal
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
A discrete mixture-based kernel for SVMs: Application to spam and image categorization

Information Processing and Management: an International Journal
An ensemble approach applied to classify spam e-mails

Expert Systems with Applications: An International Journal
Concentration based feature construction approach for spam detection

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
A mining-based approach on discovering courses pattern for constructing suitable learning path

Expert Systems with Applications: An International Journal
A scalable intelligent non-content-based spam-filtering framework

Expert Systems with Applications: An International Journal
Mining data with random forests: A survey and results of new tests

Pattern Recognition
Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets

Expert Systems with Applications: An International Journal
A survey and experimental evaluation of image spam filtering techniques

Pattern Recognition Letters
A new co-training-style random forest for computer aided diagnosis

Journal of Intelligent Information Systems
An improved K-nearest-neighbor algorithm for text categorization

Expert Systems with Applications: An International Journal
Supervised subspace projections for constructing ensembles of classifiers

Information Sciences: an International Journal
Segmental parameterisation and statistical modelling of e-mail headers for spam detection

Information Sciences: an International Journal
A generalized cluster centroid based classifier for text categorization

Information Processing and Management: an International Journal
Detecting spammers via aggregated historical data set

NSS'12 Proceedings of the 6th international conference on Network and System Security
Particle Swarm Optimization Algorithms Inspired by Immunity-Clonal Mechanism and Their Applications to Spam Detection

International Journal of Swarm Intelligence Research
Finite sets of data compatible with multidimensional inequality measures

Information Sciences: an International Journal
Learning to filter spam emails: An ensemble learning approach

International Journal of Hybrid Intelligent Systems

Quantified Score

Hi-index	0.08

Visualization

Abstract

In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naive Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naive Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits.