Parametric models of linguistic count data

Authors:
Martin Jansche
Affiliations:
The Ohio State University, Columbus, OH
Venue:
ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Year:
2003

Citing 3
Cited 5

Approximate statistical tests for comparing supervised classification learning algorithms

Neural Computation
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Empirical estimates of adaptation: the chance of two noriegas is closer to p/2 than p2

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1

Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Sentiment Detection Using Lexically-Based Classifiers

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Feature selection with a measure of deviations from Poisson in text categorization

Expert Systems with Applications: An International Journal
An improved hierarchical Bayesian model of language for document classification

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Vocabulary choice as an indicator of perspective

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is well known that occurrence counts of words in documents are often modeled poorly by standard distributions like the binomial or Poisson. Observed counts vary more than simple models predict, prompting the use of overdispersed models like Gamma-Poisson or Beta-binomial mixtures as robust alternatives. Another deficiency of standard models is due to the fact that most words never occur in a given document, resulting in large amounts of zero counts. We propose using zero-inflated models for dealing with this, and evaluate competing models on a Naive Bayes text classification task. Simple zero-inflated models can account for practically relevant variation, and can be easier to work with than overdispersed models.