Document preprocessing for naive Bayes classification and clustering with mixture of multinomials

  • Authors:
  • Dmitry Pavlov;Ramnath Balasubramanyan;Byron Dom;Shyam Kapur;Jignashu Parikh

  • Affiliations:
  • Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA

  • Venue:
  • Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the probabilistic mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiveness come from low computational cost, relatively low memory consumption, ability to handle heterogeneous features and multiple classes, and often competitiveness with the top of the line models. Recently, there has been several attempts to alleviate the problems of Naive Bayes by performing heuristic feature transformations, such as IDF, normalization by the length of the documents and taking the logarithms of the counts. We justify the use of these techniques and apply them to two problems: classification of products in Yahoo! Shopping and clustering the vectors of collocated terms in user queries to Yahoo! Search. The experimental evaluation allows us to draw conclusions about the promise that these transformations carry with regard to alleviating the strong assumptions of the multinomial model.