Automatic indexing based on Bayesian inference networks
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Information extraction as a basis for high-precision text classification
ACM Transactions on Information Systems (TOIS)
Inductive learning algorithms and representations for text categorization
Proceedings of the seventh international conference on Information and knowledge management
MailCat: an intelligent assistant for organizing e-mail
Proceedings of the third annual conference on Autonomous Agents
Text databases & document management
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Machine Learning
Machine Learning
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
Information Retrieval
Feature Subset Selection in Text-Learning
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Challenges of the Email Domain for Text Classification
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Feature Reduction for Neural Network Based Text Categorization
DASFAA '99 Proceedings of the Sixth International Conference on Database Systems for Advanced Applications
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
ECIR'03 Proceedings of the 25th European conference on IR research
Grindstone4Spam: An optimization toolkit for boosting e-mail classification
Journal of Systems and Software
Hi-index | 0.00 |
In this paper, we report our experience on the use of phrases as basic features in the email classification problem. We performed extensive empirical evaluation using our large email collections and tested with three text classification algorithms, namely, a naive Bayes classifier and two k-NN classifiers using TF-IDF weighting and resemblance respectively. The investigation includes studies on the effect of phrase size, the size of local and global sampling, the neighbourhood size, and various methods to improve the classification accuracy. We determined suitable settings for various parameters of the classifiers and performed a comparison among the classifiers with their best settings. Our result shows that no classifier dominates the others in terms of classification accuracy. Also, we made a number of observations on the special characteristics of emails. In particular, we observed that public emails are easier to classify than private ones.