Enhanced Good-Turing and Cat-Cal: two new methods for estimating probabilities of English bigrams

  • Authors:
  • Kenneth W. Church;William A. Gale

  • Affiliations:
  • AT&T Bell Laboratories;AT&T Bell Laboratories

  • Venue:
  • HLT '89 Proceedings of the workshop on Speech and Natural Language
  • Year:
  • 1989

Quantified Score

Hi-index 0.00

Visualization

Abstract

For many pattern recognition applications including speech recognition and optical character recognition, prior models of language are used to disambiguate otherwise equally probable outputs. It is common practice to use tables of probabilities of single words, pairs of words, and triples of words (n-grams) as a prior model. Our research is directed to 'backing-off' methods, that is, methods, that build an (n+1)- gram model from an n-gram model.In principle, n-gram probabilities can be estimated from a large sample of text, by counting the number of occurrences of each n-gram of interest and dividing by the size of the training sample. Unfortunately, this simple method, known as the "maximum likelihood estimator" (MLE), is unsuitable because n-grams which do not occur in the training text are assigned zero probability. In addition, the MLE does not distinguish among bigrams with the same frequency.We study two alternative methods for estimating the frequency of a given bigram in a test corpus, given a training corpus. The first method is an enhanced version of the method due to Good and Turing (Good, 1953). Under the modest assumption that the distribution of each bigram is binomial, Good provided a theoretical result that increases estimation accuracy. The second method assumes even less, merely that training and test corpora are generated by the same process. We refer to this purely empirical method as the Categorize-Calibrate (or Cat-Cal) method.We emphasize three points about these methods. First, by using a second predictor of the probability in addition to the observed frequency, it is possible to estimate different probabilities for bigrams with the same frequency. We refer to this use of a second predictor as "enhancement." With enhancement, we find 1200 significantly different probabilities (with a range of five orders of magnitude) for the group of bigrams not observed in the training text; the MLE method would not be able to distinguish any one of these bigrams from any other. Second, both methods provide (estimated) variances for the errors in estimating the n-gram probabilities. Third, the variances are used in a refined testing method that enables us to study small differences between methods. We find that Cat-Cal should be used when counts are very small, and otherwise, GT is the method of choice.