Enhanced Good-Turing and Cat-Cal: two new methods for estimating probabilities of English bigrams

Authors:
Kenneth W. Church;William A. Gale
Affiliations:
AT&T Bell Laboratories;AT&T Bell Laboratories
Venue:
HLT '89 Proceedings of the workshop on Speech and Natural Language
Year:
1989

Citing 1
Cited 5

On the Recognition of Printed Characters of Any Font and Size

IEEE Transactions on Pattern Analysis and Machine Intelligence

Dedication to William A. Gale

Natural Language Engineering
Parsing a natural language using mutual information statistics

AAAI'90 Proceedings of the eighth National conference on Artificial intelligence - Volume 2
How many multiword expressions do people know?

MWE '11 Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
Applying collocation segmentation to the ACL anthology reference corpus

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
How many multiword expressions do people know?

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

For many pattern recognition applications including speech recognition and optical character recognition, prior models of language are used to disambiguate otherwise equally probable outputs. It is common practice to use tables of probabilities of single words, pairs of words, and triples of words (n-grams) as a prior model. Our research is directed to 'backing-off' methods, that is, methods, that build an (n+1)- gram model from an n-gram model.In principle, n-gram probabilities can be estimated from a large sample of text, by counting the number of occurrences of each n-gram of interest and dividing by the size of the training sample. Unfortunately, this simple method, known as the "maximum likelihood estimator" (MLE), is unsuitable because n-grams which do not occur in the training text are assigned zero probability. In addition, the MLE does not distinguish among bigrams with the same frequency.We study two alternative methods for estimating the frequency of a given bigram in a test corpus, given a training corpus. The first method is an enhanced version of the method due to Good and Turing (Good, 1953). Under the modest assumption that the distribution of each bigram is binomial, Good provided a theoretical result that increases estimation accuracy. The second method assumes even less, merely that training and test corpora are generated by the same process. We refer to this purely empirical method as the Categorize-Calibrate (or Cat-Cal) method.We emphasize three points about these methods. First, by using a second predictor of the probability in addition to the observed frequency, it is possible to estimate different probabilities for bigrams with the same frequency. We refer to this use of a second predictor as "enhancement." With enhancement, we find 1200 significantly different probabilities (with a range of five orders of magnitude) for the group of bigrams not observed in the training text; the MLE method would not be able to distinguish any one of these bigrams from any other. Second, both methods provide (estimated) variances for the errors in estimating the n-gram probabilities. Third, the variances are used in a refined testing method that enables us to study small differences between methods. We find that Cat-Cal should be used when counts are very small, and otherwise, GT is the method of choice.