Improving word segmentation by simultaneously learning phonotactics

Authors:
Daniel Blanchard;Jeffrey Heinz
Affiliations:
University of Delaware;University of Delaware
Venue:
CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Year:
2008

Citing 10
Cited 2

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing

Foundations of statistical natural language processing
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
A statistical model for word discovery in transcribed speech

Computational Linguistics
Acquiring a lexicon from unsegmented speech

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Nonparametric bayesian models of lexical acquisition

Nonparametric bayesian models of lexical acquisition
Unsupervised word segmentation for Sesotho using Adaptor Grammars

SigMorPhon '08 Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology

Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation

CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The most accurate unsupervised word segmentation systems that are currently available (Brent, 1999; Venkataraman, 2001; Goldwater, 2007) use a simple unigram model of phonotactics. While this simplifies some of the calculations, it overlooks cues that infant language acquisition researchers have shown to be useful for segmentation (Mattys et al., 1999; Mattys and Jusczyk, 2001). Here we explore the utility of using bigram and trigram phono-tactic models by enhancing Brent's (1999) MBDP-1 algorithm. The results show the improved MBDP-Phon model outperforms other unsupervised word segmentation systems (e.g., Brent, 1999; Venkataraman, 2001; Goldwater, 2007).