Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

Authors:
Pablo Bermejo;Jose A. Gámez;Jose M. Puerta
Affiliations:
Intelligent Systems and Data Mining Group, Computing Systems Department (I3A), Universidad de Castilla-La Mancha, Albacete, Spain;Intelligent Systems and Data Mining Group, Computing Systems Department (I3A), Universidad de Castilla-La Mancha, Albacete, Spain;Intelligent Systems and Data Mining Group, Computing Systems Department (I3A), Universidad de Castilla-La Mancha, Albacete, Spain
Venue:
Expert Systems with Applications: An International Journal
Year:
2011

Citing 30
Cited 3

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Representation and learning in information retrieval

Representation and learning in information retrieval
Feature Extraction, Construction and Selection: A Data Mining Perspective

Feature Extraction, Construction and Selection: A Data Mining Perspective
Support Vector Machines for Classification in Nonstandard Situations

Machine Learning
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Feature Engineering for Text Classification

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Constructing X-of-n Attributes With A Genetic Algorithm

GECCO '02 Proceedings of the Genetic and Evolutionary Computation Conference
Concept-Learning in the Presence of Between-Class and Within-Class Imbalances

AI '01 Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence
Text Mining: A New Frontier for Lossless Compression

DCC '99 Proceedings of the Conference on Data Compression
Term Weighting Approaches in Automatic Text Retrieval

Term Weighting Approaches in Automatic Text Retrieval
An introduction to variable and feature selection

The Journal of Machine Learning Research
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Editorial: special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Mining with rarity: a unifying framework

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Class imbalances versus small disjuncts

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Extreme re-balancing for SVMs: a case study

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
An iterative method for multi-class cost-sensitive learning

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Poisson naive Bayes for text classification with feature weighting

AsianIR '03 Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11
Some Effective Techniques for Naive Bayes Text Classification

IEEE Transactions on Knowledge and Data Engineering
Statistical Comparisons of Classifiers over Multiple Data Sets

The Journal of Machine Learning Research
The class imbalance problem: A systematic study

Intelligent Data Analysis
Author identification: Using text sampling to handle the class imbalance problem

Information Processing and Management: an International Journal
Improving Imbalanced Multidimensional Dataset Learner Performance with Artificial Data Generation: Density-Based Class-Boost Algorithm

ICDM '08 Proceedings of the 8th industrial conference on Advances in Data Mining: Medical Applications, E-Commerce, Marketing, and Theoretical Aspects
Imbalanced text classification: A term weighting approach

Expert Systems with Applications: An International Journal
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
On multi-class cost-sensitive learning

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
An e-mail analysis method based on text mining techniques

Applied Soft Computing
Towards automatic and optimal filtering levels for feature selection in text categorization

IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis

Automatically tagging email by leveraging other users' folders

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics

Expert Systems with Applications: An International Journal
Automated crime report analysis and classification for e-government and decision support

Proceedings of the 14th Annual International Conference on Digital Government Research

Quantified Score

Hi-index	12.06

Visualization

Abstract

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.