Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation

Authors:
Luc Boruta;Sharon Peperkamp;Benoît Crabbé;Emmanuel Dupoux
Affiliations:
Univ. Paris Diderot, Sorbonne Paris Cité, ALPAGE, INRIA, Paris, France and LSCP--DEC, École des Hautes Études en Sciences Sociales, École Normale SupÉrieure, Centre Nation ...;LSCP-DEC, École des Hautes Études en Sciences Sociales, École Normale Supérieure, Centre National de la Recherche Scientifique, Paris, France;Univ. Paris Diderot, Sorbonne Paris Cité, ALPAGE, INRIA, Paris, France;LSCP-DEC, École des Hautes Études en Sciences Sociales, École Normale Supérieure, Centre National de la Recherche Scientifique, Paris, France
Venue:
CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Year:
2011

Citing 6
Cited 1

An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
A statistical model for word discovery in transcribed speech

Computational Linguistics
Improving word segmentation by simultaneously learning phonotactics

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Unsupervised word segmentation for Sesotho using Adaptor Grammars

SigMorPhon '08 Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology
The design of phoneme grouping for coarse phoneme recognition

IEA/AIE'07 Proceedings of the 20th international conference on Industrial, engineering, and other applications of applied intelligent systems
Online Learning Mechanisms for Bayesian Models of Word Segmentation

Research on Language and Computation

Bootstrapping a unified model of lexical and phonetic acquisition

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Models of the acquisition of word segmentation are typically evaluated using phonemically transcribed corpora. Accordingly, they implicitly assume that children know how to undo phonetic variation when they learn to extract words from speech. Moreover, whereas models of language acquisition should perform similarly across languages, evaluation is often limited to English samples. Using child-directed corpora of English, French and Japanese, we evaluate the performance of state-of-the-art statistical models given inputs where phonetic variation has not been reduced. To do so, we measure segmentation robustness across different levels of segmental variation, simulating systematic allophonic variation or errors in phoneme recognition. We show that these models do not resist an increase in such variations and do not generalize to typologically different languages. From the perspective of early language acquisition, the results strengthen the hypothesis according to which phonological knowledge is acquired in large part before the construction of a lexicon.