A maximum entropy approach to natural language processing
Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Semiring frameworks and algorithms for shortest-distance problems
Journal of Automata, Languages and Combinatorics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A statistical model for word discovery in transcribed speech
Computational Linguistics
Critical tokenization and its properties
Computational Linguistics
A bottom-up merging algorithm for Chinese unknown word extraction
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Discriminative Reranking for Natural Language Parsing
Computational Linguistics
Chinese segmentation and new word detection using conditional random fields
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
An example-based study on chinese word segmentation using critical fragments
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm
IEEE Transactions on Information Theory
Hi-index | 0.00 |
Fast re-training of word segmentation models is required for adapting to new resources or domains in NLP of many Asian languages without word delimiters. The traditional tokenization model is efficient but inaccurate. This paper proposes a phrase-based model that factors sentence tokenization into phrase tokenizations, the dependencies of which are also taken into account. The model has a good OOV recognition ability, which improves the overall performance significantly. The training is a linear time phrase extraction and MLE procedure, while the decoding is via dynamic programming based algorithms.