Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Accessor variety criteria for Chinese word extraction
Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A discriminative latent variable chinese segmenter with hybrid word/character information
NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Punctuation as implicit annotations for chinese word segmentation
Computational Linguistics
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Word representations: a simple and general method for semi-supervised learning
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Profiting from mark-up: hyper-text annotations for guided parsing
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Word-based and character-based word segmentation models: comparison and combination
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Hi-index | 0.00 |
This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.