The effects of lexical specialization on the growth curve of the vocabulary

  • Authors:
  • R. Harald Baayen

  • Affiliations:
  • Max Planck Institute for Psycholinguistics

  • Venue:
  • Computational Linguistics
  • Year:
  • 1996

Quantified Score

Hi-index 0.00

Visualization

Abstract

The number of different words expected on the basis of the urn model to appear in, for example, the first half of a text, is known to overestimate the observed number of different words. This paper examines the source of this overestimation bias. It is shown that this bias does not arise due to sentence-bound syntactic constraints, but that it is a direct consequence of topic cohesion in discourse. The nonrandom, clustered appearance of lexically specialized words, often the key words of the text, explains the main trends in the overestimation bias both quantitatively and qualitatively. The effects of nonrandomness are so strong that they introduce an overestimation bias in distributions of units derived from words, such as syllables and digrams. Nonrandom words usage also affects the accuracy of the Good-Turing frequency estimates which, for the lowest frequencies, reveal a strong underestimation bias. A heuristic adjusted frequency estimate is proposed that, at least for novel-sized texts, is considerably more accurate.