The effects of lexical specialization on the growth curve of the vocabulary

Authors:
R. Harald Baayen
Affiliations:
Max Planck Institute for Psycholinguistics
Venue:
Computational Linguistics
Year:
1996

Citing 1
Cited 3

Estimating lexical priors for low-frequency morphologically ambiguous forms

Computational Linguistics

A statistical analysis of morphemes in Japanese terminology

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Non-interactive OCR post-correction for giga-scale digitization projects

CICLing'08 Proceedings of the 9th international conference on Computational linguistics and intelligent text processing
An asymptotic model for the english hapax/vocabulary ratio

Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The number of different words expected on the basis of the urn model to appear in, for example, the first half of a text, is known to overestimate the observed number of different words. This paper examines the source of this overestimation bias. It is shown that this bias does not arise due to sentence-bound syntactic constraints, but that it is a direct consequence of topic cohesion in discourse. The nonrandom, clustered appearance of lexically specialized words, often the key words of the text, explains the main trends in the overestimation bias both quantitatively and qualitatively. The effects of nonrandomness are so strong that they introduce an overestimation bias in distributions of units derived from words, such as syllables and digrams. Nonrandom words usage also affects the accuracy of the Good-Turing frequency estimates which, for the lowest frequencies, reveal a strong underestimation bias. A heuristic adjusted frequency estimate is proposed that, at least for novel-sized texts, is considerably more accurate.