Document preprocessing for naive Bayes classification and clustering with mixture of multinomials

Authors:
Dmitry Pavlov;Ramnath Balasubramanyan;Byron Dom;Shyam Kapur;Jignashu Parikh
Affiliations:
Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA;Yahoo Inc., Sunnyvale, CA
Venue:
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2004

Citing 3
Cited 10

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Using analytic QP and sparseness to speed training of support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Towards scalable support vector machines using squashing

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining

Local sparsity control for naive Bayes with extreme misclassification costs

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Unity: relevance feedback using user query logs

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical approach to crosslingual natural language tasks

Journal of Algorithms
The ineffectiveness of within-document term frequency in text classification

Information Retrieval
Practical lessons of data mining at Yahoo!

Proceedings of the 18th ACM conference on Information and knowledge management
Improving document clustering in a learned concept space

Information Processing and Management: an International Journal
Recommending ephemeral items at web scale

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Removing smoothing from naive bayes text classifier

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
User participation prediction in online forums

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Text analysis for detecting terrorism-related articles on the web

Journal of Network and Computer Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Naive Bayes classifier has long been used for text categorization tasks. Its sibling from the unsupervised world, the probabilistic mixture of multinomial models, has likewise been successfully applied to text clustering problems. Despite the strong independence assumptions that these models make, their attractiveness come from low computational cost, relatively low memory consumption, ability to handle heterogeneous features and multiple classes, and often competitiveness with the top of the line models. Recently, there has been several attempts to alleviate the problems of Naive Bayes by performing heuristic feature transformations, such as IDF, normalization by the length of the documents and taking the logarithms of the counts. We justify the use of these techniques and apply them to two problems: classification of products in Yahoo! Shopping and clustering the vectors of collocated terms in user queries to Yahoo! Search. The experimental evaluation allows us to draw conclusions about the promise that these transformations carry with regard to alleviating the strong assumptions of the multinomial model.