An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery

Authors:
Michael R. Brent
Affiliations:
Department of Cognitive Science, Johns Hopkins University, Baltimore, MD 21218. brent@jhu.edu
Venue:
Machine Learning - Special issue on natural language learning
Year:
1999

Citing 9
Cited 47

Inferring decision trees using the minimum description length principle

Information and Computation
Redundancy reduction as a strategy for unsupervised learning

Neural Computation
An introduction to Kolmogorov complexity and its applications

An introduction to Kolmogorov complexity and its applications
Bayesian learning of probabilistic language models

Bayesian learning of probabilistic language models
Statistical methods for speech recognition

Statistical methods for speech recognition
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Introduction to Information Theory and Data Compression

Introduction to Information Theory and Data Compression
The Unsupervised Acquisition of a Lexicon from Continuous Speech

The Unsupervised Acquisition of a Lexicon from Continuous Speech
A stochastic process for word frequency distributions

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics

Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming

Machine Learning
An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Learning the lexicon from raw texts for open-vocabulary Korean word recognition

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
A statistical model for word discovery in transcribed speech

Computational Linguistics
Chinese text segmentation with MBDP-1: making the most of training corpora

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Unsupervised segmentation of words using prior distributions of morph length and frequency

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Minimally supervised morphological analysis by multimodal alignment

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Unsupervised discovery of morphemes

MPL '02 Proceedings of the ACL-02 workshop on Morphological and phonological learning - Volume 6
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
An algorithm for the unsupervised learning of morphology

Natural Language Engineering
Unsupervised models for morpheme segmentation and morphology learning

ACM Transactions on Speech and Language Processing (TSLP)
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A hybrid back-transliteration system for Japanese

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Morph-based speech recognition and modeling of out-of-vocabulary words across languages

ACM Transactions on Speech and Language Processing (TSLP)
Voting experts: An unsupervised algorithm for segmenting sequences

Intelligent Data Analysis
Applications of corpus-based semantic similarity and word segmentation to database schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Automatic discovery of topics and acoustic morphemes from speech

Computer Speech and Language
A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
Improving word segmentation by simultaneously learning phonotactics

CoNLL '08 Proceedings of the Twelfth Conference on Computational Natural Language Learning
Unsupervised discovery of Persian morphemes

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations
Greek word segmentation using minimal information

HLT-SRWS '04 Proceedings of the Student Research Workshop at HLT-NAACL 2004
Lexical and grammatical inference

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Induction of a simple morphology for highly-inflecting languages

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Segment predictability as a cue in word segmentation: application to modern Greek

SIGMorPhon '04 Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology
Unsupervised word segmentation for Sesotho using Adaptor Grammars

SigMorPhon '08 Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology
Bootstrap voting experts

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Language independent word segmentation for statistical machine translation

Proceedings of the 3rd International Universal Communication Symposium
Representational bias in unsupervised learning of syllable structure

CONLL '05 Proceedings of the Ninth Conference on Computational Natural Language Learning
Methodological Review: Unsupervised grammar induction and similarity retrieval in medical language processing using the Deterministic Dynamic Associative Memory (DDAM) model

Journal of Biomedical Informatics
Learning words and their meanings from unsegmented child-directed speech

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Unsupervised search for the optimal segmentation for statistical machine translation

ACLstudent '10 Proceedings of the ACL 2010 Student Research Workshop
Integration of multiple bilingually-learned segmentation schemes into statistical machine translation

WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
Recession segmentation: simpler online word segmentation using limited resources

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Unsupervised phonemic Chinese word segmentation using adaptor grammars

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Selected operations and applications of n-tape weighted finite-state machines

FSMNLP'09 Proceedings of the 8th international conference on Finite-state methods and natural language processing
Web scale NLP: a case study on url word breaking

Proceedings of the 20th international conference on World wide web
Fully unsupervised word segmentation with BVE and MDL

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Modeling infant word segmentation

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Word segmentation as general chunking

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

The Journal of Machine Learning Research
Testing the robustness of online word segmentation: effects of linguistic diversity and phonetic variation

CMCL '11 Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Online Learning Mechanisms for Bayesian Models of Word Segmentation

Research on Language and Computation
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Computational modeling of phonetic and lexical learning in early language acquisition: Existing models and future directions

Speech Communication
Bootstrapping a unified model of lexical and phonetic acquisition

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Using rejuvenation to improve particle filtering for Bayesian word segmentation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a model-based, unsupervised algorithmfor recovering word boundaries in a natural-language text from whichthey have been deleted. The algorithm is derived from a probabilitymodel of the source that generated the text. The fundamentalstructure of the model is specified abstractly so that the detailedcomponent models of phonology, word-order, and word frequency can bereplaced in a modular fashion. The model yields alanguage-independent, prior probability distribution on all possiblesequences of all possible words over a given alphabet, based on theassumption that the input was generated by concatenating words from afixed but unknown lexicon. The model is unusual in that it treatsthe generation of a complete corpus, regardless of length, as asingle event in the probability space. Accordingly, the algorithmdoes not estimate a probability distribution on words; instead, itattempts to calculate the prior probabilities of various wordsequences that could underlie the observed text. Experiments onphonemic transcripts of spontaneous speech by parents to youngchildren suggest that our algorithm is more effective than otherproposed algorithms, at least when utterance boundaries are given andthe text includes a substantial number of short utterances.