Concentration Bounds for Unigram Language Models

Authors:
Evgeny Drukh;Yishay Mansour
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2005

Citing 0
Cited 2

Ranking Categorical Features Using Generalization Properties

The Journal of Machine Learning Research
Prediction by categorical features: generalization properties and application to feature ranking

COLT'07 Proceedings of the 20th annual conference on Learning theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O(k / m1/2). We improve its dependency on k to O(k1/4 / m1/2 + k / m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O( 1 / k + k1/2 / m). We derive a combined estimator, which has an error of approximately O(m-2/5), for any k.A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O(1 / m1/2), for any underlying distribution.We also bound the log-loss a priori, as a function of various parameters of the distribution.