Estimating lexical priors for low-frequency morphologically ambiguous forms
Computational Linguistics
A statistical analysis of morphemes in Japanese terminology
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Non-interactive OCR post-correction for giga-scale digitization projects
CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
An asymptotic model for the english hapax/vocabulary ratio
Computational Linguistics
Hi-index | 0.00 |
The number of different words expected on the basis of the urn model to appear in, for example, the first half of a text, is known to overestimate the observed number of different words. This paper examines the source of this overestimation bias. It is shown that this bias does not arise due to sentence-bound syntactic constraints, but that it is a direct consequence of topic cohesion in discourse. The nonrandom, clustered appearance of lexically specialized words, often the key words of the text, explains the main trends in the overestimation bias both quantitatively and qualitatively. The effects of nonrandomness are so strong that they introduce an overestimation bias in distributions of units derived from words, such as syllables and digrams. Nonrandom words usage also affects the accuracy of the Good-Turing frequency estimates which, for the lowest frequencies, reveal a strong underestimation bias. A heuristic adjusted frequency estimate is proposed that, at least for novel-sized texts, is considerably more accurate.