A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Representation and learning in information retrieval
Representation and learning in information retrieval
Feature Extraction, Construction and Selection: A Data Mining Perspective
Feature Extraction, Construction and Selection: A Data Mining Perspective
Support Vector Machines for Classification in Nonstandard Situations
Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Feature Engineering for Text Classification
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Constructing X-of-n Attributes With A Genetic Algorithm
GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
Concept-Learning in the Presence of Between-Class and Within-Class Imbalances
AI '01 Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Text Mining: A New Frontier for Lossless Compression
DCC '99 Proceedings of the Conference on Data Compression
Term Weighting Approaches in Automatic Text Retrieval
Term Weighting Approaches in Automatic Text Retrieval
An introduction to variable and feature selection
The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Editorial: special issue on learning from imbalanced data sets
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An iterative method for multi-class cost-sensitive learning
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Poisson naive Bayes for text classification with feature weighting
AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Some Effective Techniques for Naive Bayes Text Classification
IEEE Transactions on Knowledge and Data Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets
The Journal of Machine Learning Research
The class imbalance problem: A systematic study
Intelligent Data Analysis
Author identification: Using text sampling to handle the class imbalance problem
Information Processing and Management: an International Journal
ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
Imbalanced text classification: A term weighting approach
Expert Systems with Applications: An International Journal
Review: A review of machine learning approaches to Spam filtering
Expert Systems with Applications: An International Journal
On multi-class cost-sensitive learning
AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
An e-mail analysis method based on text mining techniques
Applied Soft Computing
Towards automatic and optimal filtering levels for feature selection in text categorization
IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis
Automatically tagging email by leveraging other users' folders
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Expert Systems with Applications: An International Journal
Automated crime report analysis and classification for e-government and decision support
Proceedings of the 14th Annual International Conference on Digital Government Research
Hi-index | 12.06 |
E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.