Nonparametric bayesian models of lexical acquisition

  • Authors:
  • Mark Johnson;Sharon J. Goldwater

  • Affiliations:
  • Brown University;Brown University

  • Venue:
  • Nonparametric bayesian models of lexical acquisition
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

The child learning language is faced with a daunting task: to learn to extract meaning from an apparently meaningless stream of sound. This thesis rests on the assumption that the kinds of generalizations the learner may make are constrained by the interaction of many different types of stochastic information, including innate learning biases. I use computational modeling to investigate how the generalizations made by unsupervised learners are affected by the sources of information available to them. I adopt a Bayesian perspective, where both internal representations of language and any learning biases are made explicit. I begin by presenting a generic framework for language modeling based on nonparametric Bayesian statistics, where model complexity grows with the amount of input data. This framework divides the work of modeling between a generator, which generates lexical items, and an adaptor, which generates frequencies for those items. Separating the two tasks in this way makes the framework flexible, allowing individual components to be easily modified. Standard sampling methods, such as Gibbs or Metropolis-Hastings sampling, may be used for inference. Using this framework, I develop several specific models to investigate questions related to morphological acquisition (identifying stems and suffixes) and word segmentation (identifying word boundaries in phonemically transcribed speech). I apply these models to English corpora of newspaper text and phonemically transcribed child-directed speech. With regard to morphology, my experiments provide evidence that morphological information is learned better from word types than from word tokens. With regard to word segmentation, my results indicate that assuming independence between words (as many previous models have done) leads to undersegmentation of the data. Accounting for local context improves segmentation markedly and yields better results than previous models. I conclude by describing briefly how the models presented here can be extended in order to account for a wider range of linguistic phenomena, including phonetic variability and the relationship between morphology and syntactic class.