Improving word segmentation by simultaneously learning phonotactics
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Unsupervised word segmentation for Sesotho using Adaptor Grammars
SigMorPhon '08 Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology
Unsupervised morphological segmentation and clustering with document boundaries
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Unsupervised and constrained Dirichlet process mixture models for verb clustering
GEMS '09 Proceedings of the Workshop on Geometrical Models of Natural Language Semantics
An efficient algorithm for unsupervised word segmentation with branching entropy and MDL
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Covariance in Unsupervised Learning of Probabilistic Grammars
The Journal of Machine Learning Research
Fully unsupervised word segmentation with BVE and MDL
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Word segmentation as general chunking
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Online Learning Mechanisms for Bayesian Models of Word Segmentation
Research on Language and Computation
Using rejuvenation to improve particle filtering for Bayesian word segmentation
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Hi-index | 0.00 |
The child learning language is faced with a daunting task: to learn to extract meaning from an apparently meaningless stream of sound. This thesis rests on the assumption that the kinds of generalizations the learner may make are constrained by the interaction of many different types of stochastic information, including innate learning biases. I use computational modeling to investigate how the generalizations made by unsupervised learners are affected by the sources of information available to them. I adopt a Bayesian perspective, where both internal representations of language and any learning biases are made explicit. I begin by presenting a generic framework for language modeling based on nonparametric Bayesian statistics, where model complexity grows with the amount of input data. This framework divides the work of modeling between a generator, which generates lexical items, and an adaptor, which generates frequencies for those items. Separating the two tasks in this way makes the framework flexible, allowing individual components to be easily modified. Standard sampling methods, such as Gibbs or Metropolis-Hastings sampling, may be used for inference. Using this framework, I develop several specific models to investigate questions related to morphological acquisition (identifying stems and suffixes) and word segmentation (identifying word boundaries in phonemically transcribed speech). I apply these models to English corpora of newspaper text and phonemically transcribed child-directed speech. With regard to morphology, my experiments provide evidence that morphological information is learned better from word types than from word tokens. With regard to word segmentation, my results indicate that assuming independence between words (as many previous models have done) leads to undersegmentation of the data. Accounting for local context improves segmentation markedly and yields better results than previous models. I conclude by describing briefly how the models presented here can be extended in order to account for a wider range of linguistic phenomena, including phonetic variability and the relationship between morphology and syntactic class.