An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery
Machine Learning - Special issue on natural language learning
Introduction To Automata Theory, Languages, And Computation
Introduction To Automata Theory, Languages, And Computation
A procedure for unsupervised lexicon learning
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
The Markov Expert for Finding Episodes in Time Series
DCC '05 Proceedings of the Data Compression Conference
Unsupervised learning of morphology for English and Inuktitut
NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Efficient unsupervised recursive word segmentation using minimum description length
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Nonparametric bayesian models of lexical acquisition
Nonparametric bayesian models of lexical acquisition
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
An efficient algorithm for unsupervised word segmentation with branching entropy and MDL
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
From phoneme to morpheme: another verification using a corpus
ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
A regularized compression method to unsupervised word segmentation
SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Iterative annotation transformation with predict-self reestimation for Chinese word segmentation
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Hi-index | 0.00 |
Several results in the word segmentation literature suggest that description length provides a useful estimate of segmentation quality in fully unsupervised settings. However, since the space of potential segmentations grows exponentially with the length of the corpus, no tractable algorithm follows directly from the Minimum Description Length (MDL) principle. Therefore, it is necessary to generate a set of candidate segmentations and select between them according to the MDL principle. We evaluate several algorithms for generating these candidate segmentations on a range of natural language corpora, and show that the Bootstrapped Voting Experts algorithm consistently outperforms other methods when paired with MDL.