Phrase-based approach for adaptive tokenization

Authors:
Jianqiang Ma;Dale Gerdemann
Affiliations:
University of Tübingen, Tübingen, Germany;University of Tübingen, Tübingen, Germany
Venue:
SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Year:
2012

Citing 13
Cited 0

A maximum entropy approach to natural language processing

Computational Linguistics
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Semiring frameworks and algorithms for shortest-distance problems

Journal of Automata, Languages and Combinatorics
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
A statistical model for word discovery in transcribed speech

Computational Linguistics
Critical tokenization and its properties

Computational Linguistics
A bottom-up merging algorithm for Chinese unknown word extraction

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Discriminative Reranking for Natural Language Parsing

Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Better k-best parsing

Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
An example-based study on chinese word segmentation using critical fragments

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Error bounds for convolutional codes and an asymptotically optimum decoding algorithm

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fast re-training of word segmentation models is required for adapting to new resources or domains in NLP of many Asian languages without word delimiters. The traditional tokenization model is efficient but inaccurate. This paper proposes a phrase-based model that factors sentence tokenization into phrase tokenizations, the dependencies of which are also taken into account. The model has a good OOV recognition ability, which improves the overall performance significantly. The training is a linear time phrase extraction and MLE procedure, while the decoding is via dynamic programming based algorithms.