Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Learning as search optimization: approximate large margin methods for structured prediction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Accelerated training of conditional random fields with stochastic gradient methods
ICML '06 Proceedings of the 23rd international conference on Machine learning
Chinese segmentation and new word detection using conditional random fields
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
A hybrid Markov/semi-Markov conditional random field for sequence segmentation
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Subword-based tagging by conditional random fields for Chinese word segmentation
NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
Word-based and character-based word segmentation models: comparison and combination
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Enhancing Chinese word segmentation using unlabeled data
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation
ACM Transactions on Asian Language Information Processing (TALIP)
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Probabilistic Chinese word segmentation with non-local information and stochastic training
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.