Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Modeling word burstiness using the Dirichlet distribution
ICML '05 Proceedings of the 22nd international conference on Machine learning
Sentiment Detection Using Lexically-Based Classifiers
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Feature selection with a measure of deviations from Poisson in text categorization
Expert Systems with Applications: An International Journal
An improved hierarchical Bayesian model of language for document classification
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Vocabulary choice as an indicator of perspective
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Hi-index | 0.00 |
It is well known that occurrence counts of words in documents are often modeled poorly by standard distributions like the binomial or Poisson. Observed counts vary more than simple models predict, prompting the use of overdispersed models like Gamma-Poisson or Beta-binomial mixtures as robust alternatives. Another deficiency of standard models is due to the fact that most words never occur in a given document, resulting in large amounts of zero counts. We propose using zero-inflated models for dealing with this, and evaluate competing models on a Naive Bayes text classification task. Simple zero-inflated models can account for practically relevant variation, and can be easier to work with than overdispersed models.