An Efficient, Probabilistically Sound Algorithm for Segmentation andWord Discovery
Machine Learning - Special issue on natural language learning
An Algorithm for Segmenting Categorical Time Series into Meaningful Episodes
IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
A statistical model for word discovery in transcribed speech
Computational Linguistics
Chinese word segmentation without using lexicon and hand-crafted training data
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Accessor variety criteria for Chinese word extraction
Computational Linguistics
Morph-based speech recognition and modeling of out-of-vocabulary words across languages
ACM Transactions on Speech and Language Processing (TSLP)
Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Sampling alignment structure under a Bayesian translation model
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Cross-lingual propagation for morphological analysis
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Inducing compact but accurate tree-substitution grammars
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Online EM for unsupervised models
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Learning document-level semantic properties from free-text annotations
Journal of Artificial Intelligence Research
Unsupervised word segmentation for Sesotho using Adaptor Grammars
SigMorPhon '08 Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies
Journal of the ACM (JACM)
A note on the implementation of hierarchical dirichlet processes
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Language independent word segmentation for statistical machine translation
Proceedings of the 3rd International Universal Communication Symposium
Punctuation as implicit annotations for chinese word segmentation
Computational Linguistics
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A Gibbs sampler for phrasal synchronous grammar induction
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Unsupervised tokenization for machine translation
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Reading to learn: constructing features from semantic abstracts
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Bayesian learning of phrasal tree-to-string templates
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Painless unsupervised learning with features
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Bayesian synchronous tree-substitution grammar induction and its application to sentence compression
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Blocked inference in Bayesian tree substitution grammars
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
WMT '10 Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR
An efficient algorithm for unsupervised word segmentation with branching entropy and MDL
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Unsupervised phonemic Chinese word segmentation using adaptor grammars
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Nonparametric word segmentation for machine translation
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Inducing Tree-Substitution Grammars
The Journal of Machine Learning Research
Web scale NLP: a case study on url word breaking
Proceedings of the 20th international conference on World wide web
Word segmentation for dialect translation
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
A method to measure the reading difficulty of Japanese words
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Modeling syntactic context improves morphological segmentation
CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
The Journal of Machine Learning Research
A distributed look-up architecture for text mining applications using MapReduce
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A new unsupervised approach to word segmentation
Computational Linguistics
Discovering morphological paradigms from plain text using a Dirichlet process mixture model
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Method to build a bilingual lexicon for speech-to-speech translation systems
CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Unsupervized word segmentation: the case for Mandarin Chinese
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A regularized compression method to unsupervised word segmentation
SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
A bayesian model for learning SCFGs with discontiguous rules
EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
The study of effect of length in morphological segmentation of agglutinative languages
MM '12 Proceedings of the First Workshop on Multilingual Modeling
Segmenting web-domains and hashtags using length specific models
Proceedings of the 21st ACM international conference on Information and knowledge management
The application of kalman filter based human-computer learning model to chinese word segmentation
CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I
A Bayesian Alignment Approach to Transliteration Mining
ACM Transactions on Asian Language Information Processing (TALIP)
Hi-index | 0.00 |
Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on sub-optimal search procedures.