Using phrases as features in email classification

Authors:
Matthew Chang;Chung Keung Poon
Affiliations:
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, China;Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong, China
Venue:
Journal of Systems and Software
Year:
2009

Citing 16
Cited 1

Automatic indexing based on Bayesian inference networks

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Information extraction as a basis for high-precision text classification

ACM Transactions on Information Systems (TOIS)
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
MailCat: an intelligent assistant for organizing e-mail

Proceedings of the third annual conference on Autonomous Agents
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
The CN2 Induction Algorithm

Machine Learning
Induction of Decision Trees

Machine Learning
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
Feature Subset Selection in Text-Learning

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Challenges of the Email Domain for Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Feature Reduction for Neural Network Based Text Categorization

DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Taming wild phrases

ECIR'03 Proceedings of the 25th European conference on IR research

Grindstone4Spam: An optimization toolkit for boosting e-mail classification

Journal of Systems and Software

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we report our experience on the use of phrases as basic features in the email classification problem. We performed extensive empirical evaluation using our large email collections and tested with three text classification algorithms, namely, a naive Bayes classifier and two k-NN classifiers using TF-IDF weighting and resemblance respectively. The investigation includes studies on the effect of phrase size, the size of local and global sampling, the neighbourhood size, and various methods to improve the classification accuracy. We determined suitable settings for various parameters of the classifiers and performed a comparison among the classifiers with their best settings. Our result shows that no classifier dominates the others in terms of classification accuracy. Also, we made a number of observations on the special characteristics of emails. In particular, we observed that public emails are easier to classify than private ones.