Improving the performance of Naive Bayes multinomial in e-mail foldering by introducing distribution-based balance of datasets

  • Authors:
  • Pablo Bermejo;Jose A. Gámez;Jose M. Puerta

  • Affiliations:
  • Intelligent Systems and Data Mining Group, Computing Systems Department (I3A), Universidad de Castilla-La Mancha, Albacete, Spain;Intelligent Systems and Data Mining Group, Computing Systems Department (I3A), Universidad de Castilla-La Mancha, Albacete, Spain;Intelligent Systems and Data Mining Group, Computing Systems Department (I3A), Universidad de Castilla-La Mancha, Albacete, Spain

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.06

Visualization

Abstract

E-mail foldering or e-mail classification into user predefined folders can be viewed as a text classification/categorization problem. However, it has some intrinsic properties that make it more difficult to deal with, mainly the large cardinality of the class variable (i.e. the number of folders), the different number of e-mails per class state and the fact that this is a dynamic problem, in the sense that e-mails arrive in our mail-folders following a time-line. Perhaps because of these problems, standard text-oriented classifiers such as Naive Bayes Multinomial do no obtain a good accuracy when applied to e-mail corpora. In this paper, we identify the imbalance among classes/folders as the main problem, and propose a new method based on learning and sampling probability distributions. Our experiments over a standard corpus (ENRON) with seven datasets (e-mail users) show that the results obtained by Naive Bayes Multinomial significantly improve when applying the balancing algorithm first. For the sake of completeness in our experimental study we also compare this with another standard balancing method (SMOTE) and classifiers.