A discriminative latent variable chinese segmenter with hybrid word/character information

Authors:
Xu Sun;Yaozhong Zhang;Takuya Matsuzaki;Yoshimasa Tsuruoka;Jun'ichi Tsujii
Affiliations:
University of Tokyo;University of Tokyo;University of Tokyo;University of Manchester;University of Tokyo, Japan and University of Manchester, UK
Venue:
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Year:
2009

Citing 7
Cited 5

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning as search optimization: approximate large margin methods for structured prediction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Sequential labeling with latent variables: an exact inference algorithm and its efficient approximation

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A hybrid Markov/semi-Markov conditional random field for sequence segmentation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Subword-based tagging by conditional random fields for Chinese word segmentation

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

Word-based and character-based word segmentation models: comparison and combination

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Enhancing Chinese word segmentation using unlabeled data

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
Fast online training with frequency-adaptive learning rates for Chinese word segmentation and new word detection

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Probabilistic Chinese word segmentation with non-local information and stochastic training

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.