Smoothing multinomial naïve bayes in the presence of imbalance

  • Authors:
  • Alexander Y. Liu;Cheryl E. Martin

  • Affiliations:
  • Applied Research Laboratories, The University of Texas at Austin, Austin, Texas;Applied Research Laboratories, The University of Texas at Austin, Austin, Texas

  • Venue:
  • MLDM'11 Proceedings of the 7th international conference on Machine learning and data mining in pattern recognition
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multinomial naïve Bayes is a popular classifier used for a wide variety of applications. When applied to text classification, this classifier requires some form of smoothing when estimating parameters. Typically, Laplace smoothing is used, and researchers have proposed several other successful forms of smoothing. In this paper, we show that common preprocessing techniques for text categorization have detrimental effects when using several of these well-known smoothing methods. We also introduce a new form of smoothing for which these detrimental effects are less severe: ROSE smoothing, which can be derived from methods for cost-sensitive learning and imbalanced datasets. We show empirically on text data that ROSE smoothing performs well compared to known methods of smoothing, and is the only method tested that performs well regardless of the type of text preprocessing used. It is particularly effective compared to existing methods when the data is imbalanced.